Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

MAS3902

Bayesian Inference

Semester 1, 2019–20

Dr. Lee Fawcett


School of Mathematics, Statistics & Physics
Course overview

You were introduced to the Bayesian approach to statistical inference in MAS2903. This
module showed statistical analysis in a very different light to the frequentist approach
used in other courses. The frequentist approach bases inference on the sampling dis-
tribution of (usually unbiased) estimators; as you may recall, the Bayesian framework
combines information expressed as expert subjective opinion with experimental data. You
have probably realised that the Bayesian approach has many advantages over the fre-
quentist approach. In particular it provides a more natural way of dealing with parameter
uncertainty and inference is far more straightforward to interpret.
Much of the work in this module will be concerned with extending the ideas presented in
MAS2903 to more realistic models with many parameters that you may encounter in real
life situations. These notes are split into four chapters:

• Chapter 1 reviews some of the key results for Bayesian inference of single param-
eter problems studied in Stage 2. It also introduces the idea of a mixture prior
distribution.

• Chapter 2 studies the case of a random sample from a normal population and
determines how to make inferences about the population mean and precision, and
about future values from the population. The Group Project is based on this
material.

• Chapter 3 contains some general results for multi-parameter problems. You will
encounter familiar concepts, such as how to represent vague prior information and
the asymptotic normal posterior distribution.

• Chapter 4 introduces Markov chain Monte Carlo techniques which have truly revo-
lutionised the use of Bayesian inference in applications. Inference proceeds by sim-
ulating realisations from the posterior distribution. The ideas will be demonstrated
using an R library specially written for the module. This material is extended in the
4th year module MAS8951: Modern Bayesian Inference.
Contents

1 Single parameter problems 1


1.1 Prior and posterior distributions . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Different levels of prior knowledge . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Substantial prior knowledge . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Limited prior knowledge . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Vague prior knowledge . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Asymptotic posterior distribution . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mixture prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Learning objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 1

Single parameter problems

This chapter reviews some of the key results for Bayesian inference of single parameter
problems studied in MAS2903.

1.1 Prior and posterior distributions

Suppose we have data x = (x1 , x2 , . . . , xn )T which we model using the probability (density)
function f (x|θ), which depends on a single parameter θ. Once we have observed the data,
f (x|θ) is the likelihood function for θ and is a function of θ (for fixed x) rather than of x
(for fixed θ).
Also, suppose we have prior beliefs about likely values of θ expressed by a probability
(density) function π(θ). We can combine both pieces of information using the following
version of Bayes Theorem. The resulting distribution for θ is called the posterior distri-
bution for θ as it expresses our beliefs about θ after seeing the data. It summarises all
our current knowledge about the parameter θ.
Using Bayes Theorem, the posterior probability (density) function for θ is

π(θ) f (x|θ)
π(θ|x) =
f (x)

where
R
 Θ π(θ) f (x|θ) dθ
 if θ is continuous,
f (x) =

P
Θ π(θ) f (x|θ) if θ is discrete.

Also, as f (x) is not a function of θ, Bayes Theorem can be rewritten as

π(θ|x) ∝ π(θ) × f (x|θ)


i.e. posterior ∝ prior × likelihood.

1
2 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Example 1.1

Table 1.1 shows some data on the number of cases of foodbourne botulism in England
and Wales. It is believed that cases occur at random at a constant rate θ in time (a
Poisson process) and so can be modelled as a random sample from a Poisson distribution
with mean θ.

Year 1998 1999 2000 2001 2002 2003 2004 2005


Cases 2 0 0 0 1 0 2 1

Table 1.1: Number of cases of foodbourne botulism in England and Wales, 1998–2005

An expert in the epidemiology of similar diseases gives their prior distribution for the rate θ
as a Ga(2, 1) distribution, with density

π(θ) = θ e −θ , θ > 0, (1.1)

and mean E(θ) = 2 and variance V ar (θ) = 2. Determine the posterior distribution for θ.

Solution

The data are observations on Xi |θ ∼ P o(θ), i = 1, 2, . . . , 8 (independent). Therefore,


the likelihood function for θ is
8
Y θxi e −θ
f (x|θ) = , θ>0
i=1
xi !
θ2+0+···+1 e −8θ
= , θ>0
2! × 0! × · · · × 1!
θ6 e −8θ
= , θ > 0. (1.2)
4

Bayes Theorem combines the expert opinion with the observed data, and gives the pos-
terior density function as

π(θ|x) ∝ π(θ) f (x|θ)


θ6 e −8θ
∝ θe −θ × , θ>0
4
= k θ7 e −9θ , θ > 0. (1.3)

The only continuous distribution with density of the form kθg−1 e −hθ , θ > 0 is the Ga(g, h)
distribution. Therefore, the posterior distribution must be θ|x ∼ Ga(8, 9).
Thus the data have updated our beliefs about θ from a Ga(2, 1) distribution to a Ga(8, 9)
distribution. Plots of these distributions are given in Figure 1.1, and Table 1.2 gives a
1.1. PRIOR AND POSTERIOR DISTRIBUTIONS 3

1.2
1.0
0.8
density

0.6
0.4
0.2
0.0

0 1 2 3 4 5

Figure 1.1: Prior (dashed) and posterior (solid) densities for θ

summary of the main changes induced by incorporating the data — a Ga(g, h) distribution
has mean g/h, variance g/h2 and mode (g − 1)/h.
Notice that, as the mode of the likelihood function is close to that of the prior distribution,
the information in the data is consistent with that in the prior distribution. Also there
is a reduction in variability from the prior to the posterior distributions. The similarity
between the prior beliefs and the data has reduced the uncertainty we have about the
rate θ at which cases occur.
Prior Likelihood Posterior
(1.1) (1.2) (1.3)
Mode(θ) 1.00 0.75 0.78
E(θ) 2.00 – 0.89
SD(θ) 1.41 – 0.31

Table 1.2: Changes in beliefs about θ

Example 1.2

Consider now the general case of Example 1.1: suppose Xi |θ ∼ P o(θ), i = 1, 2, . . . , n


(independent) and our prior beliefs about θ are summarised by a Ga(g, h) distribution
(with g and h known), with density
hg θg−1 e −hθ
π(θ) = , θ > 0. (1.4)
Γ(g)
Determine the posterior distribution for θ.
4 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Solution

The likelihood function for θ is


n
Y θxi e −θ
f (x|θ) = , θ>0
i=1
xi !
nx̄ −nθ
∝θ e , θ > 0. (1.5)

Using Bayes Theorem, the posterior density function is

π(θ|x) ∝ π(θ) f (x|θ)


hg θg−1 e −hθ
∝ × θnx̄ e −nθ , θ>0
Γ(g)
i.e. π(θ|x) = kθg+nx̄−1 e −(h+n)θ , θ>0 (1.6)

where k is a constant that does not depend on θ. Therefore, the posterior density takes
the form kθG−1 e −Hθ , θ > 0 and so the posterior must be a gamma distribution. Thus we
have θ|x ∼ Ga(G = g + nx̄, H = h + n).
1.1. PRIOR AND POSTERIOR DISTRIBUTIONS 5

Summary:
If we have a random sample from a P o(θ) distribution and our prior beliefs about θ follow
a Ga(g, h) distribution then, after incorporating the data, our (posterior) beliefs about θ
follow a Ga(g + nx̄, h + n) distribution.
The changes in our beliefs about θ are summarised in Table 1.3, taking g ≥ 1. Notice

Prior Likelihood Posterior


(1.4) (1.5) (1.6)
Mode(θ) (g − 1)/h x̄ (g + nx̄ − 1)/(h + n)
E(θ) g/h – (g + nx̄)/(h + n)
√ √
SD(θ) g/h – g + nx̄/(h + n)

Table 1.3: Changes in beliefs about θ

that the posterior mean is greater than the prior mean if and only if the likelihood mode
is greater than the prior mean, that is,

E(θ|x) > E(θ) ⇐⇒ Modeθ {f (x|θ)} > E(θ).

The standard deviation of the posterior distribution is smaller than that of the prior
distribution if and only if the sample mean is not too large, that is
 n
SD(θ|x) < SD(θ) ⇐⇒ Modeθ {f (x|θ)} < 2 + E(θ),
h
and this will be true in large samples.

Example 1.3

Suppose we have a random sample from a normal distribution. In Bayesian statistics, when
dealing with the normal distribution, the mathematics is more straightforward working
with the precision (= 1/variance) of the distribution rather than the variance itself.
So we will assume that this population has unknown mean µ but known precision τ :
Xi |µ ∼ N(µ, 1/τ ), i = 1, 2, . . . , n (independent), where τ is known. Suppose our prior
beliefs about µ can be summarised by a N(b, 1/d) distribution, with probability density
function
 1/2  
d d 2
π(µ) = exp − (µ − b) . (1.7)
2π 2
Determine the posterior distribution for µ.
Hint:
  2
2 2 db + nτ x̄
d(µ − b) + nτ (x̄ − µ) = (d + nτ ) µ − +c
d + nτ
where c does not depend on µ.
6 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Solution

The likelihood function for µ is


n 
Y τ 1/2 n τ
2
o
f (x|µ) = exp − (xi − µ)
i=1
2π 2
( n
)
 τ n/2 τX
= exp − (xi − µ)2 .
2π 2 i=1

Now
n
X n
X
2
(xi − µ) = (xi − x̄ + x̄ − µ)2
i=1 i=1
n
X
= {(xi − x̄)2 + (x̄ − µ)2 + 2(xi − x̄)(x̄ − µ)}
i=1
n
X n
X
2 2
= {(xi − x̄) + (x̄ − µ) } + 2(x̄ − µ) (xi − x̄)
i=1 i=1
Xn
= (xi − x̄)2 + n(x̄ − µ)2 .
i=1
n
1X
2
Let s = (xi − x̄)2 and so
n i=1
n
X
(xi − µ)2 = n s 2 + (x̄ − µ)2 .
 
i=1

Therefore
 τ n/2 n nτ  o
f (x|µ) = exp − s 2 + (x̄ − µ)2 . (1.8)
2π 2
Using Bayes Theorem, the posterior density function is, for µ ∈ R
π(µ|x) ∝ π(µ) f (x|µ)
 1/2  
d d 2
∝ exp − (µ − b)
2π 2
 τ n/2 n nτ  o
2 2
× exp − s + (x̄ − µ)
 2π 2
1
∝ exp − d(µ − b)2 + nτ (x̄ − µ)2

2
( "   2 #)
1 db + nτ x̄
∝ exp − (d + nτ ) µ − +c
2 d + nτ

using the hint


(  "  2 #)
1 db + nτ x̄
∝ exp − (d + nτ ) µ −
2 d + nτ
1.1. PRIOR AND POSTERIOR DISTRIBUTIONS 7

as c does not depend on µ. Let


db + nτ x̄
B= and D = d + nτ. (1.9)
d + nτ
Then  
D 2
π(µ|x) = k exp − (µ − B) , (1.10)
2
where k is a constant that does not depend on µ. Therefore, the posterior density takes
the form k exp{−D(µ−B)2 /2}, µ ∈ R and so the posterior distribution must be a normal
distribution: we have µ|x ∼ N(B, 1/D).
8 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Summary:
If we have a random sample from a N(µ, 1/τ ) distribution (with τ known) and our prior
beliefs about µ follow a N(b, 1/d) distribution then, after incorporating the data, our
(posterior) beliefs about µ follow a N(B, 1/D) distribution.
The changes in our beliefs about µ are summarised in Table 1.4. Notice that the posterior

Prior Likelihood Posterior


(1.7) (1.8) (1.10)
Mode(µ) b x̄ (db + nτ x̄)/(d + nτ )
E(µ) b – (db + nτ x̄)/(d + nτ )
P r eci sion(µ) d – d + nτ

Table 1.4: Changes in beliefs about µ

mean is greater than the prior mean if and only if the likelihood mode (sample mean) is
greater than the prior mean, that is

E(µ|x) > E(µ) ⇐⇒ Modeµ {f (x|µ)} > E(µ).

Also, the standard deviation of the posterior distribution is smaller than that of the prior
distribution.

Example 1.4

The 18th century physicist Henry Cavendish made 23 experimental determinations of the
earth’s density, and these data (in g/cm3 ) are given below.

5.36 5.29 5.58 5.65 5.57 5.53 5.62 5.29


5.44 5.34 5.79 5.10 5.27 5.39 5.42 5.47
5.63 5.34 5.46 5.30 5.78 5.68 5.85

Suppose that Cavendish asserts that the error standard deviation of these measurements
is 0.2 g/cm3 , and assume that they are normally distributed with mean equal to the
true earth density µ. Using a normal prior distribution for µ with mean 5.41 g/cm3 and
standard deviation 0.4 g/cm3 , derive the posterior distribution for µ.

Solution

From the data we calculate x̄ = 5.4848 and s = 0.1882. Therefore, the as-
sumed standard deviation σ = 0.2 is probably okay. We also have τ = 1/0.22 , b = 5.41,
d = 1/0.42 and n = 23. Therefore, using Example 1.3, the posterior distribution is
µ|x ∼ N(B, 1/D), where
db + nτ x̄ 5.41/0.42 + 23 × 5.4848/0.22
B= = = 5.4840
d + nτ 1/0.42 + 23/0.22
1.2. DIFFERENT LEVELS OF PRIOR KNOWLEDGE 9

The actual mean density of the earth is 5.515 g/cm3 (Wikipedia). We can determine the
and
(posterior) probability that the 1mean density
23 is within
1 0.1 of this value as follows. The
D = d is
+ µ|x
nτ = + = 2 .
posterior distribution ∼ N(5.484, 0.0415 ) and so
0.42 0.22 0.04152
P r (5.415 <isµµ|x
Therefore the posterior distribution < 5.615|x) = 0.9510,
∼ N(5.484, 0.04152 ) and is shown in Fig-
ure 1.2.
calculated using the R command pnorm(5.615,5.484,0.0415)-pnorm(5.415,5.484,0.0415).
Without the data, the only basis for determining the earth’s density is via the prior
distribution. Here the prior distribution is µ ∼ N(5.4, 0.42 ) and so the (prior) probability
that the mean density is within 0.2 of the (now known) true value is

P r (5.315 < µ < 5.715) = 0.1896,

calculated using the R command pnorm(5.615,5.4,0.4)-pnorm(5.415,5.4,0.4).


8
6
density

4
2
0

5.2 5.4 5.6 5.8

Figure 1.2: Prior (dashed) and posterior (solid) densities for the earth’s density

1.2 Different levels of prior knowledge

1.2.1 Substantial prior knowledge

We have substantial prior information for θ when the prior distribution dominates the
posterior distribution, that is π(θ|x) ∼ π(θ).
When we have substantial prior information there can be some difficulties:
10 CHAPTER 1. SINGLE PARAMETER PROBLEMS

1. the intractability of the mathematics in deriving the posterior distribution — though


with modern computing facilities this is less of a problem,

2. the practical formulation of the prior distribution — coherently specifying prior


beliefs in the form of a probability distribution is far from straightforward.

1.2.2 Limited prior knowledge

When prior information about θ is limited, the pragmatic approach is to choose a distri-
bution which makes the Bayes updating from prior to posterior mathematically straight-
forward, and use what prior information is available to determine the parameters of this
distribution. For example

• Poisson random sample, Gamma prior distribution −→ Gamma posterior distribution

• Normal random sample (known variance), Normal prior distribution −→ Normal


posterior distribution

In these examples, the prior distribution and the posterior distribution come from the
same family. This leads us to the following definition.

Definition 1.1

Suppose that data x are to be observed with distribution f (x|θ). A family F of prior
distributions for θ is said to be conjugate to f (x|θ) if for every prior distribution π(θ) ∈ F,
the posterior distribution π(θ|x) is also in F.

Notice that the conjugate family depends crucially on the model chosen for the data x.
For example, the only family conjugate to the model “random sample from a Poisson
distribution” is the Gamma family.

1.2.3 Vague prior knowledge

If we have very little or no prior information about the model parameters θ, we must
still choose a prior distribution in order to operate Bayes Theorem. Obviously, it would
be sensible to choose a prior distribution which is not concentrated about any particular
value, that is, one with a very large variance. In particular, most of the information
about θ will be passed through to the posterior distribution via the data, and so we have
π(θ|x) ∼ f (x|θ).
We represent vague prior knowledge by using a prior distribution which is conjugate to
the model for x and which is as diffuse as possible, that is, has as large a variance as
possible.
1.2. DIFFERENT LEVELS OF PRIOR KNOWLEDGE 11

Example 1.5

Suppose we have a random sample from a N(µ, 1/τ ) distribution (with τ known). De-
termine the posterior distribution assuming a vague prior for µ.

Solution

The conjugate prior distribution is a normal distribution. We have already seen


that if the prior is µ ∼ N(b, 1/d) then the posterior distribution is µ|x ∼ N(B, 1/D)
where
db + nτ x̄
B= and D = d + nτ.
d + nτ
If we now make our prior knowledge vague about µ by letting the prior variance tend to
infinity (d → 0), we obtain

B → x̄ and D → nτ.

Therefore, assuming vague prior knowledge for µ results in a N{x̄, 1/(nτ )} posterior
distribution.
Notice that the posterior mean is the sample mean (the likelihood mode) and that the
posterior variance 1/(nτ ) → 0 as n → ∞.
12 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Example 1.6

Suppose we have a random sample from a Poisson distribution, that is, Xi |θ ∼ P o(θ),
i = 1, 2, . . . , n (independent). Determine the posterior distribution assuming a vague
prior for θ.

Solution

The conjugate prior distribution is a Gamma distribution. Recall that a Ga(g, h) dis-
tribution has mean m = g/h and variance v = g/h2 . Rearranging these formulae we
obtain
m2 m
g= and h= .
v v
Clearly g → 0 and h → 0 as v → ∞ (for fixed m). We have seen how taking a Ga(g, h)
prior distribution results in a Ga(g + nx̄, h + n) posterior distribution. Therefore, taking
a vague prior distribution will give a Ga(nx̄, n) posterior distribution.
Note that the posterior mean is x̄ (the likelihood mode) and that the posterior variance
x̄/n → 0 and n → ∞.

1.3 Asymptotic posterior distribution

If we have a statistical model f (x|θ) for data x = (x1 , x2 , . . . , xn )T , together with a prior
distribution π(θ) for θ then
q
D
J(θ̂) (θ − θ̂)|x −→ N(0, 1) as n → ∞,

where θ̂ is the likelihood mode and J(θ) is the observed information


∂2
J(θ) = − log f (x|θ).
∂θ2

This means that, with increasing amounts of data, the posterior distribution looks more
and more like a normal distribution. The result also gives us a useful approximation to
the posterior distribution for θ when n is large:
θ|x ∼ N{θ̂, J(θ̂)−1 } approximately.
Note that this limiting result is similar to one used in Frequentist statistics for the distri-
bution of the maximum likelihood estimator, namely
D
p
I(θ) (θ̂ − θ) −→ N(0, 1) as n → ∞,
where Fisher’s information I(θ) is the expected value of the observed information, where
the expectation is taken over the distribution of X|θ, that is, I(θ) = EX|θ [J(θ)]. You may
also have seen this result written as an approximation to the distribution of the maximum
likelihood estimator in large samples, namely
θ̂ ∼ N{θ, I(θ)−1 } approximately.
1.4. BAYESIAN INFERENCE 13

Example 1.7

Suppose we have a random sample from a N(µ, 1/τ ) distribution (with τ known). De-
termine the asymptotic posterior distribution for µ.
Recall that
( n
)
 τ n/2 τX
f (x|µ) = exp − (xi − µ)2 ,
2π 2 i=1

and therefore

n
n n τX
log f (x|µ) = log τ − log(2π) − (xi − µ)2
2 2 2 i=1
n n
∂ τ X X
⇒ log f (x|µ) = − × −2(xi − µ) = τ (xi − µ) = nτ (x̄ − µ)
∂µ 2 i=1 i=1
∂2 ∂2
⇒ log f (x|µ) = −nτ ⇒ J(µ) = − log f (x|µ) = nτ.
∂µ2 ∂µ2

Solution

We have


log f (x|µ) = 0 =⇒ µ̂ = x̄
∂µ
=⇒ J(µ̂) = nτ
1
=⇒ J(µ̂)−1 = .

Therefore, for large n, the (approximate) posterior distribution for µ is


 
1
µ|x ∼ N x̄, .

Here the asymptotic posterior distribution is the same as the posterior distribution under
vague prior knowledge.

1.4 Bayesian inference

The posterior distribution π(θ|x) summarises all our information about θ to date. How-
ever, sometimes it is helpful to reduce this distribution to a few key summary measures.
14 CHAPTER 1. SINGLE PARAMETER PROBLEMS

1.4.1 Estimation

Point estimates

There are many useful summaries for a typical value of a random variable with a particular
distribution; for example, the mean, mode and median. The mode is used more often as
a summary than is the case in frequentist statistics.

Confidence intervals/regions

A more useful summary of the posterior distribution is one which also reflects its variation.
For example, a 100(1 − α)% Bayesian confidence interval for θ is any region Cα that
satisfies P r (θ ∈ Cα |x) = 1 − α. If θ is a continuous quantity with posterior probability
density function π(θ|x) then
Z
π(θ|x) dθ = 1 − α.

The usual correction is made for discrete θ, that is, we take the largest region Cα such
that P r (θ ∈ Cα |x) ≤ 1 − α. Bayesian confidence intervals are sometimes called credible
regions or plausible regions. Clearly these intervals are not unique, since there will be
many intervals with the correct probability coverage for a given posterior distribution.
A 100(1 − α)% highest density interval (HDI) for θ is the region

Cα = {θ : π(θ|x) ≥ γ}

where γ is chosen so that P r (θ ∈ Cα |x) = 1 − α. This region is sometimes called


a most plausible Bayesian confidence interval. If the posterior distribution has many
modes then it is possible that the HDI will be the union of several disjoint regions.
Also, if the posterior distribution is unimodal (has one mode) and symmetric about its
mean then the HDI is an equi-tailed interval, that is, takes the form Cα = (a, b), where
P r (θ < a|x) = P r (θ > b|x) = α/2; see Figure 1.3.
1.4. BAYESIAN INFERENCE 15

density

a b

Figure 1.3: Construction of an HDI for a symmetric posterior density

Interpretation of confidence intervals/regions

Suppose CB is a 95% Bayesian confidence interval for θ and CF is a 95% frequentist


confidence interval for θ. These intervals do not have the same interpretation:
• the probability that CB contains θ is 0.95;

• the probability that CF contains θ is either 0 or 1 — since θ does not have a


(non-degenerate) probability distribution;

• the interval CF covers the true value θ on 95% of occasions — in repeated appli-
cations of the formula.

Example 1.8

Suppose we have a random sample x = (x1 , x2 , . . . , xn )T from a N(µ, 1/τ ) distribution


(where τ is known). We have seen that, assuming vague prior knowledge, the posterior
distribution is µ|x ∼ N{x̄, 1/(nτ )}. Determine the 100(1 − α)% HDI for µ.

Solution

This distribution has a symmetric bell shape and so the HDI is an equi-tailed
interval Cα = (a, b) with P r (µ < a|x) = α/2 and P r (µ > b|x) = α/2, that is,
zα/2 zα/2
a = x̄ − √ and b = x̄ + √ ,
nτ nτ
16 CHAPTER 1. SINGLE PARAMETER PROBLEMS

where zα is the upper α-quantile of the N(0, 1) distribution. For example, the 95% HDI
for µ is  
1.96 1.96
x̄ − √ , x̄ + √ .
nτ nτ

Note that this interval is numerically identical to the 95% frequentist confidence interval
for the (population) mean of a normal random sample with known variance. However,
the interpretation is very different.

Example 1.9

Recall Example 1.1 on the number of cases of foodbourne botulism in England and
Wales. The data were modelled as a random sample from a Poisson distribution with
mean θ. Using a Ga(2, 1) prior distribution, we found the posterior distribution to be
θ|x ∼ Ga(8, 9). This posterior density is shown in Figure 1.4. Determine the 100(1−α)%
HDI for θ.
1.2
1.0
0.8
density

0.6
0.4
0.2
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 1.4: Posterior density for θ

Solution

The HDI must take the form Cα = (a, b) if it is to include the values of θ with
the highest probability density. Suppose that F (·) and f (·) are the posterior distribution
and density functions. Then the end-points a and b must satisfy

P r (a < θ < b|x) = F (b) − F (a) = 1 − α


1.4. BAYESIAN INFERENCE 17

and

π(θ = a|x) − π(θ = b|x) = f (a) − f (b) = 0.

Unfortunately, there is no simple analytical solution to these equations and so numerical


methods have to be employed to determine a and b. However, if we have the quantile
function F −1 (·) for θ|x then we can find the solution by noticing that we can write b in
terms of a:
b = F −1 {1 − α + F (a)},
for 0 < a < F −1 (α). Therefore, we can determine the correct choice of a by minimising
the function 2
g(a) = f (a) − f [F −1 {1 − α + F (a)}] .
within the range 0 < a < F −1 (α). The values of a and b can be determined using the
optimizer function optimize in R.
18 CHAPTER 1. SINGLE PARAMETER PROBLEMS

The R package nclbayes contains functions to determine the HDI for several distribu-
tions. The function for the Gamma distribution is hdiGamma and we can calculate the
95% HDI for the Ga(8, 9) posterior distribution by using the commands

library(nclbayes)
hdiGamma(p=0.95,a=8,b=9)

Taking 1 − α = 0.95 and using such R code gives a = 0.3304362 and b = 1.5146208.
To check this answer, R gives P r (a < θ < b|x) = 0.95, π(θ = b|x) = 0.1877215 and
π(θ = a|x) = 0.1877427. Thus the 95% HDI is (0.3304362, 1.514621).
The package also has functions hdiBeta for the Beta distribution and hdiInvchi for the
Inv-Chi distribution (introduced in Chapter 2).

1.4.2 Prediction

Much of statistical inference (both Frequentist and Bayesian) is aimed towards making
statements about a parameter θ. Often the inferences are used as a yardstick for sim-
ilar future experiments. For example, we may want to predict the outcome when the
experiment is performed again.
Clearly there will be uncertainty about the future outcome of an experiment. Suppose
this future outcome Y is described by a probability (density) function f (y |θ). There are
several ways we could make inferences about what values of Y are likely. For example, if
we have an estimate θ̂ of θ we might base our inferences on f (y |θ = θ̂). Obviously this
is not the best we can do, as such inferences ignore the fact that it is very unlikely that
θ = θ̂.
Implicit in the Bayesian framework is the concept of the predictive distribution. This
distribution describes how likely are different outcomes of a future experiment. The
predictive probability (density) function is calculated as
Z
f (y |x) = f (y |θ) π(θ|x) dθ
Θ

when θ is a continuous quantity. From this equation, we can see that the predictive
distribution is formed by weighting the possible values of θ in the future experiment
f (y |θ) by how likely we believe they are to occur π(θ|x).
If the true value of θ were known, say θ0 , then any prediction can do no better than one
based on f (y |θ = θ0 ). However, as (generally) θ is unknown, the predictive distribution
is used as the next best alternative.
We can use the predictive distribution to provide a useful range of plausible values for the
outcome of a future experiment. This prediction interval is similar to a HDI interval. A
100(1 − α)% prediction interval for Y is the region Cα = {y : f (y |x) ≥ γ} where γ is
chosen so that P r (Y ∈ Cα |x) = 1 − α.
1.4. BAYESIAN INFERENCE 19

Example 1.10

Recall Example 1.1 on the number of cases of foodbourne botulism in England and Wales.
The data for 1998–2005 were modelled by a Poisson distribution with mean θ. Using
a Ga(2, 1) prior distribution, we found the posterior distribution to be θ|x ∼ Ga(8, 9).
Determine the predictive distribution for the number of cases for the following year (2006).

Solution

Suppose the number of cases in 2006 is Y , with Y |θ ∼ P o(θ). The predictive


probability function of Y is, for y = 0, 1, . . .
Z
f (y |x) = f (y |θ) π(θ|x) dθ
Θ
Z ∞ y −θ
θ e 98 θ7 e −9θ
= × dθ
0 y! Γ(8)

Z ∞
98
= θy +7 e −10θ dθ
y !Γ(8) 0
98 Γ(y + 8)
= ×
y !Γ(8) 10y +8
(y + 7)!
= × 0.98 × 0.1y
y !7!
 
y +7
= × 0.98 × 0.1y .
7

You may not recognise this probability function but it is related to that of a negative
binomial distribution. Suppose Z ∼ NegBin(r, p) with probability function
 
z −1 r
P r (Z = z) = p (1 − p)z−r , z = r, r + 1, . . . .
r −1

Then W = Z − r has probability function


 
w +r −1 r
P r (W = w ) = P r (Z = w + r ) = p (1 − p)w , w = 0, 1, . . . .
r −1

This is the same probability function as our predictive probability function, with r = 8
and p = 0.9. Therefore Y |x ∼ NegBin(8, 0.9) − 8. Note that, unfortunately R also
calls the distribution of W a negative binomial distribution with parameters r and p. To
distinguish between this distribution and the NegBin(r, p) distribution used above, we
shall denote the distribution of W as a NegBinR (r, p) distribution – it has mean r (1−p)/p
and variance r (1 − p)/p 2 . Thus Y |x ∼ NegBinR (8, 0.9).

We can compare this predictive distribution with a naive predictive distribution based on
an estimate of θ. Here we shall base our naive predictive distribution on the maximum
20 CHAPTER 1. SINGLE PARAMETER PROBLEMS

likelihood estimate θ̂ = 0.75, that is, use the distribution Y |θ = θ̂ ∼ P o(0.75). Thus,
the naive predictive probability function is

0.75y e −0.75
f (y |θ = θ̂) = , y = 0, 1, . . . .
y!

Numerical values for the predictive and naive predictive probability functions are given in
Table 1.5.

correct naive
y f (y |x) f (y |θ = θ̂)
0 0.430 0.472
1 0.344 0.354
2 0.155 0.133
3 0.052 0.033
4 0.014 0.006
5 0.003 0.001
≥6 0.005 0.002

Table 1.5: Predictive and naive predictive probability functions

Again, the naive predictive distribution is a predictive distribution which, instead of us-
ing the correct posterior distribution, uses a degenerate posterior distribution π ∗ (θ|x)
which essentially allows only one value: P rπ∗ (θ = 0.75|x) = 1 and standard deviation
SD
√ π∗ (θ|x) = 0. Note that the correct posterior standard deviation of θ is SDπ (θ|x) =
8/9 = 0.314. Using a degenerate posterior distribution results in the naive predictive
distribution having too small a standard deviation:
(
0.994 using the correct π(θ|x)
SD(Y |x = 1) =
0.866 using the naive π ∗ (θ|x),

these values being calculated from NegBinR (8, 0.9) and P o(0.75) distributions.
Using the numerical table of predictive probabilities, we can see that {0, 1, 2} is a 92.9%
prediction set/interval. This is to be contrasted with the more “optimistic” calculation
using the naive predictive distribution which shows that {0, 1, 2} is a 95.9% prediction
set/interval.

Candidate’s formula

In the previous example, a non-trivial integral had to be evaluated. However, when the
past data x and future data y are independent (given θ) and we use a conjugate prior
distribution, another (easier) method can be used to determine the predictive distribution.
1.4. BAYESIAN INFERENCE 21

Using Bayes Theorem, the posterior density for θ given x and y is


π(θ)f (x, y |θ)
π(θ|x, y ) =
f (x, y )
π(θ)f (x|θ)f (y |θ)
= since X and Y are independent given θ
f (x)f (y |x)
π(θ|x) f (y |θ)
= .
f (y |x)
Rearranging, we obtain
f (y |θ)π(θ|x)
f (y |x) = .
π(θ|x, y )
This is known as Candidate’s formula. The right-hand-side of this equation looks as if it
depends on θ but, in fact, any terms in θ will be cancelled between the numerator and
denominator.

Example 1.11

Rework Example 1.10 using Candidate’s formula to determine the number of cases in
2006.

Solution

Let Y denote the number of cases in 2006. We know that θ|x ∼ Ga(8, 9) and
Y |θ ∼ P o(θ). Using Example 1.2 we obtain

θ|x, y ∼ Ga(8 + y , 10).

Therefore the predictive probability function of Y is, for y = 0, 1, . . .


f (y |θ) π(θ|x)
f (y |x) =
π(θ|x, y )
θy e −θ 98 θ7 e −9θ
×
y! Γ(8)
= 8+y 7+y −10θ
10 θ e
Γ(8 + y )

Γ(8 + y ) 98
= × 8+y
y !Γ(8) 10
(y + 7)!
= × 0.98 × 0.1y
y !7!
 
y +7
= × 0.98 × 0.1y .
7
22 CHAPTER 1. SINGLE PARAMETER PROBLEMS

1.5 Mixture prior distributions

Sometimes prior beliefs cannot be adequately represented by a simple distribution, for ex-
ample, a normal distribution or a beta distribution. In such cases, mixtures of distributions
can be useful.

Example 1.12

Investigations into infants suffering from severe idiopathic respiratory distress syndrome
have shown that whether the infant survives may be related to their weight at birth.
Suppose that you are interested in developing a prior distribution for the mean birth
weight µ of such infants. You might have a normal N(2.3, 0.522 ) prior distribution for
the mean birth weight (in kg) of infants who survive and a normal N(1.7, 0.662 ) prior
distribution for infants who die. If you believe that the proportion of infants that survive is
0.6, what is your prior distribution of birth weights of infants suffering from this syndrome?

Solution

Let T = 1, 2 denote whether the infant survives or dies. Then the information
above tells us

µ|T = 1 ∼ N(2.3, 0.522 ) and µ|T = 2 ∼ N(1.7, 0.662 ).

In terms of density functions, we have

π(µ|T = 1) = φ µ|2.3, 0.522 π(µ|T = 2) = φ µ|1.7, 0.662 ,


 
and

where φ(·|a, b2 ) is the normal N(a, b2 ) density function.


The prior distribution of birth weights of infants suffering from this syndrome is the
(marginal) distribution of µ. Using the Law of Total Probability, the marginal density of
µ is

π(µ) = P r (T = 1) × π(µ|T = 1) + P r (T = 2) × π(µ|T = 2)


= 0.6 φ µ|2.3, 0.522 + 0.4 φ µ|1.7, 0.662 .
 

We write this as
µ ∼ 0.6 N(2.3, 0.522 ) + 0.4 N(1.7, 0.662 ).
1.5. MIXTURE PRIOR DISTRIBUTIONS 23

This prior distribution is a mixture of two normal distributions. Figure 1.5 shows the
overall (mixture) prior distribution π(µ) and the “component” distributions describing
prior beliefs about the mean weights of those who survive and those who die. Notice that,
in this example, although the mixture distribution is a combination of two distributions,
each with one mode, this mixture distribution has only one mode. Also, although the
component distributions are symmetric, the mixture distribution is not symmetric.
0.6
density

0.4
0.2
0.0

−1 0 1 2 3 4

Figure 1.5: Plot of the mixture density (solid) with its component densities (survive –
dashed; die – dotted)

Definition 1.2

A mixture of the distributions πi (θ) with weights pi (i = 1, 2, . . . , m) has probability


(density) function

m
X
π(θ) = pi πi (θ). (1.11)
i=1

Figure 1.6 contains a plot of two quite different mixture distributions. One mixture
distribution has a single mode and the other has two modes. In general, a mixture
distribution whose m component distributions each have a single mode will have at most
m modes.
24 CHAPTER 1. SINGLE PARAMETER PROBLEMS

0.8
0.6
density

0.4
0.2
0.0

−1 0 1 2 3 4

Figure 1.6: Plot of two mixture densities: solid is 0.6N(1, 1) + 0.4N(2, 1); dashed is
0.9Exp(1) + 0.1N(2, 0.252 )

In order for a mixture distribution to be proper, we must have


Z
1= π(θ) dθ
Θ
m
Z X
= pi πi (θ) dθ
Θ i=1
m
X Z
= pi πi (θ) dθ
i=1 Θ
m
X
= pi ,
i=1

that is, the sum of the weights must be one.

We can calculate the mean and variance of a mixture distribution as follows. We will
assume, for simplicity, that θ is a scalar. Let Ei (θ) and V ari (θ) be the mean and variance
of the distribution for θ in component i , that is,
Z Z
Ei (θ) = θ πi (θ) dθ and V ari (θ) = {θ − Ei (θ)}2 πi (θ) dθ.
Θ Θ

It can be shown that the mean of the mixture distribution is


m
X
E(θ) = pi Ei (θ). (1.12)
i=1
1.5. MIXTURE PRIOR DISTRIBUTIONS 25

We also have
m
X
2
E(θ ) = pi Ei (θ2 )
i=1
Xm
pi V ari (θ) + Ei (θ)2

= (1.13)
i=1

from which we can calculate the variance of the mixture distribution using

V ar (θ) = E(θ2 ) − E(θ)2 .

Combining a mixture prior distribution with data x using Bayes Theorem produces the
posterior density

π(θ) f (x|θ)
π(θ|x) =
f (x)
m
X pi πi (θ) f (x|θ)
= (1.14)
i=1
f (x)

where f (x) is a constant with respect to θ. Now if the prior density were πi (θ) (instead
of the mixture distribution), using Bayes Theorem, the posterior density would be

πi (θ) f (x|θ)
πi (θ|x) =
fi (x)

where fi (x), i = 1, 2, . . . , m are constants with respect to θ, that is πi (θ) f (x|θ) =


fi (x) πi (θ|x). Substituting this in to (1.14) gives
m
X pi fi (x)
π(θ|x) = πi (θ|x).
i=1
f (x)

Thus the posterior distribution is a mixture distribution of component distributions πi (θ|x)


with weights pi∗ = pi fi (x)/f (x). Now
m m m
X X pi fi (x) X
pi∗ =1 ⇒ =1 ⇒ f (x) = pi fi (x)
i=1 i=1
f (x) i=1

and so
pi fi (x)
pi∗ = m , i = 1, 2, . . . , m.
X
pj fj (x)
j=1

Hence, combining data x with a mixture prior distribution (pi , πi (θ)) produces a posterior
mixture distribution (pi∗ , πi (θ|x)). The effect of introducing the data is to “update” the
mixture weights (pi → pi∗ ) and the component distributions (πi (θ) → πi (θ|x)).
26 CHAPTER 1. SINGLE PARAMETER PROBLEMS

Example 1.13

Suppose we have a random sample of size 20 from an exponential distribution, that is,
Xi |θ ∼ Exp(θ), i = 1, 2, . . . , 20 (independent). Also suppose that the prior distribution
for θ is the mixture distribution

θ ∼ 0.6 Ga(5, 10) + 0.4 Ga(15, 10),

as shown in Figure 1.7. Here the component distributions are π1 (θ) = Ga(5, 10) and
π2 (θ) = Ga(15, 10), with weights p1 = 0.6 and p2 = 0.4.
1.2
1.0
0.8
density

0.6
0.4
0.2
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 1.7: Plot of the mixture prior density

Using (1.12), the prior mean is


5 15
E(θ) = 0.6 × + 0.4 × = 0.9
10 10
and, using (1.13), the prior second moment for θ is
5×6 15 × 16
E(θ2 ) = 0.6 × + 0.4 × = 1.14
102 102
from which we calculate the prior variance as

V ar (θ) = E(θ2 ) − E(θ)2 = 1.14 − 0.92 = 0.33

and prior standard deviation as


p √
SD(θ) = V ar (θ) = 0.33 = 0.574.

We have already seen that combining a random sample of size 20 from an exponential
distribution with a Ga(g, h) prior distribution results in a Ga(g + 20, h + 20x̄) posterior
1.5. MIXTURE PRIOR DISTRIBUTIONS 27

distribution. Therefore, the (overall) posterior distribution will be a mixture distribution


with component distributions

π1 (θ|x) = Ga(25, 10 + 20x̄) and π2 (θ|x) = Ga(35, 10 + 20x̄).

We now calculate new values for the weights p1∗ and p2∗ = 1 − p1∗ , which will depend on
both prior information and the data. We have
0.6f1 (x)
p1∗ =
0.6f1 (x) + 0.4f2 (x)
from which
0.4f2 (x)
(p1∗ )−1 − 1 = .
0.6f1 (x)
In general, the functions Z
fi (x) = πi (θ) f (x|θ) dθ
Θ
are potentially complicated integrals (solved either analytically or numerically). However,
as with Candidates formula, these calculations become much simpler when we have a
conjugate prior distribution: rewriting Bayes Theorem, we obtain
π(θ) f (x|θ)
f (x) =
π(θ|x)
and so when the prior and posterior densities have a simple form (as they do when using a
conjugate prior), it is straightforward to determine f (x) using algebra rather than having
to use calculus.
In this example we know that the gamma distribution is the conjugate prior distribution:
using a random sample of size n with mean x̄ and a Ga(g, h) prior distribution gives a
Ga(g + n, h + nx̄) posterior distribution, and so
π(θ) f (x|θ)
f (x) =
π(θ|x)
hg θg−1 e −hθ
× θn e −nx̄θ
Γ(g)
=
(h + nx̄)g+n θg+n−1 e −(h+nx̄)θ
Γ(g + n)
hg Γ(g + n)
= .
Γ(g)(h + nx̄)g+n
Therefore
0.4 × 1015 Γ(35) 0.6 × 105 Γ(25)

(p1∗ )−1 −1=
Γ(15)(10 + 20x̄)35 Γ(5)(10 + 20x̄)25
2Γ(35)Γ(5)
=
3Γ(25)Γ(15)(1 + 2x̄)10
611320
=
7(1 + 2x̄)10
28 CHAPTER 1. SINGLE PARAMETER PROBLEMS

and so
1
p1∗ = , p2∗ = 1 − p1∗ .
611320
1+
7(1 + 2x̄)10

Hence the posterior distribution is the mixture distribution


 
1  1 
× Ga(25, 10 + 20x̄) + 1 −  × Ga(35, 10 + 20x̄).
611320  611320 
1+ 1+
7(1 + 2x̄)10 7(1 + 2x̄)10

Recall that the most likely value of θ from the data alone, the likelihood mode, is 1/x̄.
Therefore, large values of x̄ indicate that θ is small and vice versa. With this in mind,
it is not surprising that the weight p1∗ (of the component distribution with the smallest
mean) is increasing in x̄, and p1∗ → 1 as x̄ → ∞. Using (1.12), the posterior mean is
 
1 25  1  35
E(θ|x) = × + 1 − ×
611320 10 + 20x̄  611320  10 + 20x̄
1+ 10
1+ 10
7(1 + 2x̄) 7(1 + 2x̄)
= ···
 

 

1  2 
= 7− .
2(1 + 2x̄)  611320 
 1+ 
7(1 + 2x̄)10
 

The posterior standard deviation can be calculated using (1.12) and (1.13).
Table 1.6 shows the posterior distributions which result when various sample means x̄
are observed together with the posterior mean and the posterior standard deviation.
Graphs of these posterior distributions, together with the prior distribution, are given in
Figure 1.8. When considering the effect on beliefs of observing the sample mean x̄, it
is important to remember that large values of x̄ indicate that θ is small and vice versa.
Plots of the posterior mean against the sample mean reveal that the posterior mean lies
between the prior mean and the likelihood mode only for x̄ ∈ (0, 0.70) ∪ (1.12, ∞). Note
that observing the data has focussed our beliefs about θ in the sense that the posterior
standard deviation is less than the prior standard deviation – and considerably less in some
cases.
1.5. MIXTURE PRIOR DISTRIBUTIONS 29

x̄ θ̂ = 1/x̄ Posterior mixture distribution E(θ|x) SD(θ|x)


4 0.25 0.99997 Ga(25, 90) + 0.00003 Ga(35, 90) 0.278 0.056
2 0.5 0.9911 Ga(25, 50) + 0.0089 Ga(35, 50) 0.502 0.102
1.2 0.8 0.7027 Ga(25, 34) + 0.2973 Ga(35, 34) 0.823 0.206
1 1.0 0.4034 Ga(25, 30) + 0.5966 Ga(35, 30) 1.032 0.247
0.8 1.25 0.1392 Ga(25, 26) + 0.8608 Ga(35, 26) 1.293 0.260
0.5 2.0 0.0116 Ga(25, 20) + 0.9884 Ga(35, 20) 1.744 0.300

Table 1.6: Posterior distributions (with summaries) for various sample means x̄
6
density

4
2
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 1.8: Plot of the prior distribution and various posterior distributions
30 CHAPTER 1. SINGLE PARAMETER PROBLEMS

1.6 Learning objectives

By the end of this chapter, you should be able to:

• determine the likelihood function using a random sample from any distribution

• combine this likelihood function with any prior distribution to obtain the posterior
distribution

• name the posterior distribution if it is a “standard” distribution listed in these notes


or on the exam paper – this list may well include distributions that are standard
within the subject but which you have not met before. If the posterior distribution
is not a “standard” distribution then it is okay just to give its density (or probability
function) up to a constant.

• do all the above for a particular data set or for a general case with random sample
x1 , . . . , xn

• describe the different levels of prior information; determine and use conjugate priors
and vague priors

• determine the asymptotic posterior distribution

• determine the predictive distribution, particularly when having a random sample


from any distribution and a conjugate prior via Candidate’s formula

• describe and calculate the confidence intervals, HDIs and prediction intervals

• determine posterior distributions when the prior is a mixture of conjugate distribu-


tions

You might also like