Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Teacher’s Corner

A Default Bayesian Hypothesis Test for ANOVA Designs

Ruud WETZELS, Raoul P. P. P. GRASMAN, and Eric-Jan WAGENMAKERS

second, students want to know how prior distributions can be


This article presents a Bayesian hypothesis test for analysis chosen such that a test can be considered default. In this article
of variance (ANOVA) designs. The test is an application of stan- we address both questions. We apply the Bayesian method to
dard Bayesian methods for variable selection in regression mod- ANOVA designs and explain the rationale and impact of several
els. We illustrate the effect of various g-priors on the ANOVA default prior distributions.
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

hypothesis test. The Bayesian test for ANOVA designs is useful Thus, the first goal of this article is to show how the Bayesian
for empirical researchers and for students; both groups will get framework of hypothesis testing with the Bayes factor can be
a more acute appreciation of Bayesian inference when they can carried out in ANOVA designs. ANOVA is one of the most
apply it to practical statistical problems such as ANOVA. We popular statistical methods to assess whether or not two or more
illustrate the use of the test with two examples, and we provide population means are equal—in most experimental settings,
R code that makes the test easy to use. ANOVA is used to test for the presence of a treatment effect.
Because of its importance and simplicity, ANOVA is taught
KEY WORDS: Bayes factor; Model selection; Teaching in virtually every applied statistics course. Nevertheless, the
Bayesian statistics. Bayesian hypothesis testing literature on ANOVA is scant; the
dominant treatment of ANOVA is still classical or frequentist
(e.g., Draper and Smith 1998; Faraway 2002) and, although
the Bayesian treatment of ANOVA is gaining popularity (e.g.,
Gelman et al. 2004; Qian and Shen 2007; Ntzoufras 2009;
1. INTRODUCTION
Kaufman and Sain 2010), the latter has dealt almost exclusively
Bayesian methods have become increasingly popular in al- with estimation, not testing (for exceptions, see Westfall and
most all scientific disciplines (e.g., Poirier 2006). One important Gönen 1996; Sen and Churchill 2001; Ishwaran and Rao 2003;
reason for this gain in popularity is the ease with which Bayesian Ball 2005; Gönen et al. 2005; Maruyama 2009). This is all the
methods can be applied to relatively complex problems involv- more surprising because Bayesian hypothesis testing has been
ing, for instance, hierarchical modeling or the comparison be- well developed for variable selection in regression models (e.g.,
tween nonnested models. However, Bayesian methods can also Liang et al. 2008), of which ANOVA is a special case.
be applied in simpler statistical scenarios such as those that The second goal of this article is to describe the rationale be-
feature basic testing procedures. Prominent examples of such hind a particular family of default priors—g-priors—and to use
procedures include analysis of variance (ANOVA) and the t- these g-priors for default Bayesian tests for ANOVA designs.
test; these tests are the cornerstone of data analysis in fields We hope this work shows students and experimental researchers
such as biology, economics, sociology, and psychology. how Bayesian hypothesis tests can be a valid and practical al-
Because Bayesian methods have become more mainstream in ternative to classical or frequentist tests.
recent years, most technically oriented studies now offer at least The outline of this article is as follows. In the first section we
one course on Bayesian inference in their graduate or undergrad- briefly cover Bayesian estimation and Bayesian model selection.
uate program. Our own experience in teaching one such course In the second section we describe the various g-priors that have
is that students often ask the same questions when Bayesian been proposed in the literature on variable selection in regression
model selection and hypothesis testing are introduced. First, models. Finally, we present two worked examples that show
students are interested to know how they can apply Bayesian how the regression framework can be applied to one-way and
methods to testing problems that they face on a regular basis; two-way ANOVA designs.

Ruud Wetzels is Applied Statistician, University of Amsterdam, Weesperplein 4,


1018 XA, Amsterdam (E-mail: [email protected]). Raoul P. P. P. Gras- 2. BAYESIAN INFERENCE
man is Applied Statistician, University of Amsterdam, Weesperplein 4, 1018
XA, Amsterdam (E-mail: [email protected]). Eric-Jan Wagenmakers is 2.1 Bayesian Estimation
Cognitive Scientist, University of Amsterdam, Weesperplein 4, 1018 XA, Ams-
terdam (E-mail: [email protected]). This research was supported by In Bayesian estimation (e.g., Bernardo and Smith 1994;
Veni and Vidi grants from the Netherlands Organisation for Scientific Research Lindley 2000; O’Hagan and Forster 2004), uncertainty about
(NWO). parameters is quantified by probability distributions. Suppose,
104 © 2012 American Statistical Association DOI: 10.1080/00031305.2012.695956 The American Statistician, May 2012, Vol. 66, No. 2
we have a model M and we wish to estimate the model decreasing the average. This paradox is illustrated in Figure 3
parameters θ . Then, we have to define a prior distribution over and will be discussed later in the context of a specific model. The
these parameters; p(θ |M). When data Y come in, this prior dis- next section details how, in the context of linear regression and
tribution p(θ|M) is updated to yield the posterior distribution ANOVA, one can avoid the Jeffreys–Lindley–Bartlett paradox
p(θ|Y , M) according to Bayes’ rule: and nevertheless define prior distributions that are reasonably
p(Y |θ, M)p(θ |M) uninformative.
p(θ|Y , M) =
p(Y |M)
p(Y |θ, M)p(θ |M) 3. LINEAR REGRESSION, ANOVA, AND THE
=  SPECIFICATION OF g-PRIORS
 p(Y |θ, M)p(θ |M)dθ
∝ p(Y |θ, M)p(θ |M). The prior distributions that we will discuss are applicable to
model selection in the regression framework. Assume a response
Hence, the posterior distribution of θ is proportional to the vector Y of length n, Y = (y1 , . . . , yn )T , normally distributed
likelihood times the prior. In Bayesian parameter estimation, the with mean vector µ = (µ1 , . . . , µn )T , precision φ, and I n an
researcher is interested in the posterior distribution of the model n × n identity matrix,
parameters p(θ|Y , M). However, in Bayesian model selection
the focus is on p(Y |M), the marginal likelihood of the data
Y ∼ N(µ, I n /φ).
under model M.
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

The mean µ can be decomposed into an overall common


2.2 Bayesian Model Selection intercept α and the regression coefficients β. The mean µ then
becomes
In Bayesian model selection, competing statistical models or
hypotheses are assigned prior probabilities. Consider two com-
µ = 1n α + Xβ,
peting models, M1 and M2 with prior probabilities p(M1 ) and
p(M2 ).
After observing the data, the relative plausibility of M1 and where X represents the n × k design matrix and β is the k-
M2 is given by the ratio of posterior model probabilities, that dimensional vector of regression coefficients.
is, the posterior odds: In the ANOVA setting, the independent variables that are con-
trolled in the experiment are called factors, which in turn can
p(M1 |Y ) p(M1 ) p(Y |M1 )
= . have different levels of intensity. Then, the regression coeffi-
p(M2 |Y ) p(M2 ) p(Y |M2 ) cients are interpreted as level-specific parameters. The design
Hence, the posterior odds are given by the product of the prior matrix X is constructed using dummy coding (Draper and Smith
odds and the ratio of marginal likelihoods. The latter compo- 1998). Because the matrix [1n , X] does not necessarily have full
nent is known as the Bayes factor (Jeffreys 1961; Dickey 1971; column rank, we need to add a constraint. Here, we adopt the
Berger and Sellke 1987; Kass and Raftery 1995) and quantifies sum-to-zero constraint. By using this constraint, the intercept is
the change from prior to posterior odds; therefore, the Bayes fac- the grand mean, and each regression coefficient describes the
tor does not depend on the prior model probabilities p(M1 ) and deviation from this grand mean—consequently, the regression
p(M2 ) and quantifies the evidence that the data provide for coefficient of the last level equals minus the sum of the other
M1 versus M2 . regression coefficients.
In linear regression and ANOVA, two models of special in- In the one-way ANOVA, we examine the effect of a categor-
terest are the null model, MN , that does not include any of the ical variable X on the continuous response variable Y . The null
predictors (but does include the intercept) and the full model, hypothesis is defined as H0 ; all levels have the same mean, and
MF , that includes all relevant predictors. In this scenario, the the alternative hypothesis is defined as H1 ; at least one of the
main difficulty with the Bayes factor is its sensitivity to the levels has a different mean.
prior distribution for the model parameters under test (Press We can translate this frequentist test to a Bayesian model
et al. 2003; Berger 2006; Gelman 2008). selection situation by comparing the model with all relevant
When there is limited knowledge about the phenomenon un- regression coefficients to the model without these coefficients.
der study, the prior distribution for the parameters should be In the remainder of this article we focus on the one-way and
relatively uninformative. However, to avoid paradoxical results, two-way ANOVA and show how these tests can be carried out
the prior distribution cannot be too uninformative. In particular, in a Bayesian fashion.
the Jeffreys–Lindley–Bartlett paradox (Bartlett 1957; Jeffreys The sections below list three default prior distributions. We
1961; Lindley 1980; Shafer 1982; Berger and Delampady 1987; focus on prior distributions for variable selection in regression
Robert 1993) shows that with vague uninformative priors on as this framework provides the basis for the analysis of ANOVA
the parameters under test, the Bayes factor will strongly sup- designs (for more information on Bayesian variable selection
port the null model. The reason is that the marginal likelihood see Leamer 1978; Zellner 1986, 1987; Mitchell and Beauchamp
p(Y |M) is obtained by averaging the likelihood over the prior; 1988; Chipman 1996; George and McCulloch 1997). The fol-
when the prior is very spread out relative to the data, a large part lowing subsections detail, in historical order, three versions of
of the prior distribution is associated with very low likelihoods, the popular g-prior.

The American Statistician, May 2012, Vol. 66, No. 2 105


3.1 Zellner’s g-Prior where k equals the number of predictors of MF , n is the sample
size, and R 2 the coefficient of determination of MF (note that,
In the case of linear regression, Zellner’s g-prior (Zellner
R 2 for MN equals zero as it contains no predictors).
1986) corresponds to
  Equation (1) shows that, in its general formulation, Zellner’s
g g-prior is potentially vulnerable to the Jeffreys–Lindley–Bartlett
p(β|φ, g, X) ∼ N 0, (X T X)−1 g > 0,
φ paradox: when g → ∞ with n and k fixed, the Bayes factor,
BF[MF : MN ], will go to 0, favoring the null model regardless
with Jeffreys’ prior (Jeffreys 1961) on the precision
of the observed data (see Figure 3 for an example).
1 Another problem with the Zellner g-prior is that, when
p(φ) ∝ ,
φ the evidence in favor of the full model goes to infinity (i.e.,
and a flat prior on the common intercept α. Note that we assume R 2 goes to 1), the Bayes factor converges to the upper bound
that the columns of X are centered so that 1Tn X = 0. (1 + g)(n−k−1)/2 . Liang et al. (2008) termed this undesirable
This set of prior distributions is of the conjugate Normal- property the “information paradox.”
Gamma family, and therefore the marginal likelihood can be
calculated analytically. When the design matrix is considered 3.2 Jeffreys–Zellner–Siow (ZJS) Prior
fixed, we are allowed to use it in our prior variance term as To test whether a parameter µ is zero or nonzero (with µ the
φ
g
(X T X)−1 . Recall that the variance of the maximum likelihood mean of a normal distribution), Jeffreys (1961, pp. 268–270)
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

estimator for β, var(β̂), equals φ −1 (X T X)−1 . Hence, the term g suggested to apply a Cauchy prior. The Cauchy prior was the
is a scaling factor for the prior: if we choose g to be 1, we give simplest distribution to satisfy consistency requirements that
the prior the same weight as the sample; if we choose g to be 2, Jeffreys considered important for hypothesis testing. One such
the prior is half as important as the sample; if we choose g to be requirement is that a researcher does not want to favor one model
n, the prior is 1/nth as important as the sample. over another on the basis of a single datum.
An obvious problem with this prior distribution is how to Extending Jeffreys’ suggestion to variable selection in the
set parameter g. If g is set low, then the prior distribution for regression model, Zellner and Siow (1980) proposed a multi-
β is relatively peaked and informative. If g is set high then this variate Cauchy prior on the regression coefficients and a flat
prior is relatively spread out and uninformative. However, as prior on the common intercept. However, as the marginal like-
described in the previous section, a prior that is too vague can lihood is not analytically tractable, this approach did not gain
result in the Jeffreys–Lindley–Bartlett paradox. much popularity.
Various settings for g have been studied and proposed. A Recently, however, Liang et al. (2008) represented the
popular setting is g = n, corresponding to the so-called “unit JZS prior as a mixture of g-priors, that is, an Inverse-
information prior.” The intuition is that this prior contains as Gamma(1/2, n/2) prior on g and Jeffreys’ prior on the precision
much information as present in a single observation (Kass and φ:
Wasserman 1995); the argument is that the precision of the sam-
1
ple estimate of β contains the information of n observations. p(φ) ∝
Then the amount of information in an imaginary single obser- φ g 
vation is this quantity divided by n, hence g = n. Another well- p(β|φ, g, X) ∝ N 0, (X T X)−1 p(g)dg
known choice of g is to set it equal to the square of the number of φ
predictors of the regression model: g = k 2 (i.e., the Risk Infla- (n/2)1/2 −3/2 −n/(2g)
p(g) = g e .
tion Criterion, Foster and George 1994). Furthermore, Fernan- (1/2)
dez, Ley, and Steel (2001) suggested to take g = max{n, k 2 } as
a “benchmark prior.” This formulation combines the computational advantages of the
A quantity of interest is the so-called shrinkage factor g/(g + g-prior with the statistical advantages of the Cauchy prior. Note
1). It can be used to estimate the posterior mean of β, which is that again we assume that the columns of X are centered.
the least squares estimate of β multiplied by the shrinkage factor By assigning a prior to g, we avoid having to assign g a
g specific value; moreover, the prior on g allows us to estimate g
E[β|Y , X, M, g] = β̂, from the data and obtain data-dependent shrinkage. Equation (2)
g+1
gives the expected value of the shrinkage factor g/(g + 1) with
where β̂ is the least squares estimate of β. A low value of g the JZS approach:
pulls the posterior mean of β to zero, whereas a high value  
of g yields results similar to the least squares estimate. Note g
E |Y , M
that, somewhat confusingly, a low shrinkage factor means more g+1
shrinkage and vice versa.  ∞
To compute the Bayes factor in the one-way ANOVA design, = (1 + g)(n−k−3)/2 [1 + g(1 − R 2 )]−(n−1)/2
0
we compare the full model, MF to the null model, MN . Then,
 ∞
(−1/2) −n/(2g)
the Bayes factor is given by ×g e dg (1 + g)(n−k−1)/2
0
BF[MF : MN ] = (1 + g)(n−k−1)/2 [1 + g(1 − R 2 )]−(n−1)/2 ,

2 −(n−1)/2 −3/2 −n/(2g)


(1) × [1 + g(1−R )] g e dg . (2)

106 Teacher’s Corner


It can be seen from Equation (2), and later from Equation (4),
that the expected value of g/(g + 1) increases with R 2 (Zeugner a=4
a=3.5
and Feldkircher 2009). Hence, there is less shrinkage when more
a=3
variance is explained by the model. a=2.5
In the JZS approach, the Bayes factor comparing the full a=2.2

Density
model to the null model is:
(n/2)1/2
BF[MF : MN ] =
(1/2)
 ∞
× (1+g) (n−k−1)/2
[1+g(1 − R 2 )]−(n−1)/2 g −3/2 e−n/(2g) dg.
0
(3) 0
0 g/(g+1) 1
As pointed out by Liang et al. (2008), the integral is one-
dimensional and easily approximated using standard software Figure 1. Effect of parameter a on the shrinkage factor g/(g + 1).
packages such as R (R Development Core Team 2004). When a = 4, the prior is uniform between 0 and 1, whereas when a is
A drawback of the JZS prior is that the Bayes factor is not very close to 2, the prior distribution for the shrinkage factor has most
analytically available. However, the JZS prior is not vulnerable mass near 1. Higher values for g/(g + 1) result in less shrinkage.
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

to the Jeffreys–Lindley–Bartlett paradox nor to the information


paradox (Liang et al. 2008).
2 F1 (a, b; c; z)

(c) 1
t b−1 (1 − t)c−b−1
3.3 Hyper-g Priors = dt c > b > 0.
(c − b)(b) 0 (1 − tz)a
As an alternative to the JZS prior, Liang et al. (2008) proposed
Just as with the JZS prior, the hyper-g approach estimates g and
a family of prior distributions on g and termed this the hyper-g
allows for data-dependent shrinkage.
approach:
In order to compare the two models that are important in the
a−2 one-way ANOVA design, we need to calculate the Bayes factor
p(g) = (1 + g)−a/2 g > 0, (note that, this Bayes factor is also available in closed form using
2
the Gaussian hypergeometric function):
which is a proper distribution if a > 2 (Strawderman 1971; Cui 
and George 2008). Because this distribution leads to indetermi- a−2 ∞
BF[MF : MN ] = (1 + g)(n−k−1−a)/2
nate Bayes factors when a ≤ 2, Liang et al. (2008) studied the 2 0
behavior of this prior for 2 < a ≤ 4. Interestingly, this family of × [1 + g(1 − R 2 )]−(n−1)/2 dg. (5)
priors on g corresponds to the following prior on the shrinkage
factor g/(1 + g): Just as with the JZS prior, the hyper-g approach is not vulnerable
 a  to the Jeffreys–Lindley–Bartlett paradox, nor to the information
g paradox (when a ≤ n − k + 1, Liang et al. 2008).
∼ Beta 1, − 1 .
1+g 2
By choosing a, one can tune the prior on the shrinkage factor. 4. A BAYESIAN ONE-WAY ANOVA
When a = 4, the prior is uniform between 0 and 1, whereas
when a is very close to 2, the prior distribution for the shrinkage To illustrate the differences between the various priors and
factor will have most mass near 1. Figure 1 shows the effect the effects they have on the Bayes factor for ANOVA designs,
of various a on the prior distribution for the shrinkage fac- we first discuss the one-way ANOVA. We follow Box and Tiao
tor g/(g + 1). Furthermore, Dellaportas, Forster, and Ntzoufras (1973) and use example data from an experiment that was set
(in press) showed that the posterior densities of the parameters up to investigate to what extent yield of dyestuff differs between
are, in terms of posterior shrinkage, insensitive to the choice batches. The experiment featured six batches with five observa-
of a within the recommended range. Only for very high values tions each. Figure 2 shows the box and whisker plot of yield of
of a (in their simple linear regression example, a ≈ 20) was dyestuff for the different batches. The left plot shows the origi-
posterior shrinkage considerable. nal data from Box and Tiao (1973). To illustrate the behavior of
The expected value of the shrinkage factor g/(g + 1) with the the Bayes factor when the null hypothesis is true, the right plot
hyper-g approach is shows the same data but with equal means (i.e., the difference
  between the batch mean and the overall mean was subtracted
g from the batch data).
E Y, M
g + 1 First, we carried out a classical one-way ANOVA to compute
2 2 F1 [(n − 1)/2, 2; (k + a)/2 + 1; R 2 ] the F statistic and the corresponding p-value for both datasets.
= , (4) For the original dataset, we compute F (5, 24) = 4.60, p =
k + a 2 F1 [(n − 1)/2, 1; (k + a)/2; R 2 ]
0.004, suggesting that at least one of the batches has a different
where 2 F1 (a, b; c; z) is the Gaussian hypergeometric function yield. In the modified dataset with equal means, we compute
(Abramowitz and Stegun 1972) F (5, 24) = 0, p = 1, suggesting that the yield of dyestuff is

The American Statistician, May 2012, Vol. 66, No. 2 107


Figure 3. An illustration of the Jeffreys–Lindley–Bartlett paradox
Figure 2. Boxplots of yield of dyestuff per batch. The left plot (orig- when the Zellner g-prior is applied to the dyestuff data from Box and
inal data) shows the original data from Box and Tiao (1973). The right Tiao (1973). When g increases from 1 to 4, the Bayes factor in favor
plot (modified data) shows the same data but with the difference be- of the full model increases as well. By increasing g much further,
tween the batch mean and the overall mean subtracted from the batch the Bayes factor can be made arbitrarily close to 0, signifying infinite
data. support for the null model.
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

equal for all batches, although such an inference in favor of the model MN ; the two hyper-g priors also provide evidence in fa-
null hypothesis is not warranted in the Fisherian framework of vor of MN , albeit less extreme. Moreover, the relation between
p-value significance testing. R 2 and the shrinkage factor now becomes clear: for each prior
Next, we designed a Bayesian hypothesis test to contrast two where g is estimated (i.e., JZS, hyper-g with a = 3, and hyper-g
models. The full model, MF , contains a grand mean α and with a = 4), the shrinkage factor is lower when the null model
the predictors for batches 1–5. The predictor for batch 6 is is preferred, as is the case for the modified data.
omitted because of the sum-to-zero constraint. The null model, Finally, we use the original dyestuff data with unequal means
MN , contains no predictors. Therefore, our test concerns the to illustrate the Jeffreys–Lindley–Bartlett paradox for the one-
following two models: way ANOVA model. Under Zellner’s g-prior with g = n or
g = k 2 , the Bayes factor was in favor of the full model. However,
MF : µ = 1n α + X 1 β1 + X 2 β2 + X 3 β3 + X 4 β4 + X 5 β5 Figure 3 shows that by increasing g the Bayes factor can be made
MN : µ = 1n α. arbitrarily close to 0, indicating impressive evidence in favor of
the null model.
The results from the Bayesian hypothesis test for the data
with unequal group means, reported in Table 1, show that the
two Zellner g-priors and the JZS prior yield only modest Bayes 5. A BAYESIAN TWO-WAY ANOVA
factor support in favor of MF ; the two hyper-g priors yield To illustrate the Bayesian two-way ANOVA, we use a slightly
more convincing support in favor of MF : overall, the results more complex example from Faraway (2002). As part of an
suggest that the data may be too sparse to allow an unambiguous investigation of toxic agents, a total of 48 rats were allocated
conclusion. Importantly, a Bayes factor of 3 arguably does not to three poisons (I, II, III) and four treatments (A, B, C, D).
inspire as much confidence as one would glean from a p-value The dependent variable is the reciprocal of the survival time
as low as 0.004 (Berger and Sellke 1987). This result highlights in tens of hours, which can be interpreted as the rate of dying.
the general conflict between Bayes factors and p-values in terms Figure 4 shows the box-and-whisker plot of the survival times
of their evidential impact (e.g., Edwards, Lindman, and Savage in the different experimental conditions.
1963; Sellke, Bayarri, and Berger 2001).
When the models are compared using the modified data,
Table 1 shows that the two Zellner g-priors and the JZS prior
yield considerable Bayes factor support in favor of the null

Table 1. Bayes factors and shrinkage factors for the one-way ANOVA
example on the dyestuff data, see Figure 2. The Bayes factor compares
the full model to the null model, testing for a main effect of batch.

Unequal means Equal means

Prior BFF :N E[g/(g + 1)|Y ] BFF :N E[g/(g + 1)|Y ]

Zellner g = n 2.0 0.97 1.87 × 10−4 0.97 Figure 4. Rate of dying per poison and per treatment. Poison group I
Zellner g = k 2 2.9 0.96 2.90 × 10−4 0.96 (the reference level for poison) has a mean of 1.80; the means of groups
JZS 3.1 0.90 8.51 × 10−4 0.86 II and III are 0.47 and 2.00 higher, respectively. Treatment group A (the
Hyper-g a = 3 9.9 0.71 0.17 0.25 reference level for treatment) has a mean of 3.52; the means of groups
Hyper-g a = 4 10.1 0.65 0.29 0.22
B, C, and D are 1.66, 0.57, and 1.36 lower, respectively.

108 Teacher’s Corner


First, we carried out a classical two-way ANOVA to com- Table 2. Bayes factors for the two-way ANOVA for the rats dataset
pute the F statistics and the corresponding p-values. First, we from Faraway (2002) plotted in Figure 4. The Bayes factor compares
investigate whether the interaction terms should be incorporated the relevant models to each other to test for main effects of poison and
in the model. We compute F (6, 36) = 1.1, p ≈ 0.39, suggest- treatment, and their interaction.
ing that poison and treatment do not interact, although, again,
such inference in favor of the null hypothesis is not warranted Prior BFP T :P +T BFP +T :P BFP +T :T
in the Fisherian framework of p-value significance testing. Zellner g = n 2.61 × 10−04 6.87 × 1007 3.09 × 1012
Because the interaction effects were not significant, we re- Zellner g = k 2 1.45 × 10−05 3.41 × 1008 4.36 × 1011
move them from the model. Then, for the main effect of treat- JZS 5.37 × 10−04 4.52 × 1007 1.24 × 1012
ment, we compute F (3, 42) = 27.9, p < 0.001, suggesting that Hyper-g a = 3 9.41 × 10−04 2.95 × 1007 1.81 × 1011
Hyper-g a = 4 1.34 × 10−03 2.07 × 1007 6.72 × 1010
at least one of the treatments has an effect on rate of dying. For
the main effect of poison we compute F (2, 42) = 71.7, p <
0.001, suggesting that at least one of the poisons has an effect favor the model that includes the treatment effect, regardless of
on rate of dying. the specific choice of prior distribution.
Again, we compare the classical results to the Bayesian alter- By comparing MP +T to MT we can test for a main effect
natives. We define the necessary models that are needed to test of poison. The Bayesian hypothesis tests show that all methods
for each of the main effects and for the interaction effects. To favor the full model over the null model, regardless of the spe-
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

test for the effect of the interaction terms we define two models: cific choice of prior distribution (see Table 2). The support for
the full model containing the main and interaction effects MP T , the model with a main effect of poison is considerably higher
and the same model without the interaction effects MP +T . To then the support for the main effect for treatment.
test for the main effects we define the no-interaction model with
the effects of treatment MT ; the no-interaction model with the 6. CONCLUSION
effects of poison MP ; and the null model MN .
ANOVA is one of the most often-used statistical methods
MP T : µ=1n α + X I βI + X I I βI I + X A βA + X B βB + X C βC in the empirical sciences. However, Bayesian hypothesis tests
+ X I ×A βI ×A + X I ×B βI ×B + X I ×C βI ×C are rarely conducted in ANOVA designs; instead, most the-
+ X I I ×A βI I ×A + X I I ×B βI I ×B + X I I ×C βI I ×C, oretical development has concerned the more general prob-
MP +T : µ=1n α + X I βI + X I I βI I + X A βA + X B βB + X C βC, lem of selecting variables in regression models (e.g., Mitchell
and Beauchamp 1988; George and McCulloch 1997; Kuo
MT : µ=1n α + X A βA + X B βB + X C βC,
and Mallick 1998; Casella and Moreno 2006; O’Hara and
MP : µ=1n α + X I βI + X I I βI I, Sillanpää 2009). Here we showed how the regression frame-
MN : µ=1n α. work can be seamlessly carried over to ANOVA designs, at the
same time illustrating various default prior distributions, such
We compare the reduced models to the larger model to test for
as Zellner’s g-prior, the JZS approach, and the hyper-g approach
the effect of the predictors that were left out. If the larger model
(for a similar approach see Bayarri and Garcı́a-Donato 2007).
is preferred over the reduced model then the tested effects mat-
Of course, other Bayesian model specifications for ANOVA
ter. However, these models cannot be compared directly using
are possible; ours has the advantage that it follows directly
the methods outlined above, as these methods always feature
from the regression approach that has been studied in detail. A
the null model. Instead, we first calculate the Bayes factor com-
further didactical advantage is that many students are already
paring the larger model ML to the null model, BF[ML : MN ],
familiar with linear regression and the extension to ANOVA is
and the reduced model MR to the null model, BF[MR : MN ].
conceptually straightforward. In addition, software programs
The desired Bayes factor, BF[ML : MR ], is then obtained by
implemented in R make it easy for students and teachers to
taking the ratio of Bayes factors
apply Bayesian regression and ANOVA to inference problems
BF[ML : MN ] of practical interest; in addition, this software allows users to
BF[ML : MR ] = . compare the substantive Bayesian conclusions to those drawn
BF[MR : MN ]
from the classical p-value approach. In general, the software
We do not present the shrinkage factors because the model implementation of the theoretical framework provides students
comparison is not between the null model and the full model with the opportunity of considerable hands-on experience with
but between two models with many predictors each. Bayesian hypothesis testing, something that is likely to increase
A test for the interaction involves the comparison between not only their understanding, but also their motivation to learn.
MP T and MP +T . Table 2 shows the results for the Bayes factors We feel it is important for students to realize that there is likely
that test for the presence of the interaction terms. The different no single correct prior distribution; in fact, it can be informative
priors do not change the overall conclusion: all priors support to use different priors in a sensitivity analysis. If different plau-
the model without the interaction terms. Hence, we drop the sible prior distributions lead to different substantive conclusions
interaction terms from the ANOVA model and proceed with the it is best to acknowledge that the data are ambiguous.
main effects only. Although not the focus of this article, post hoc comparisons
By comparing MP +T to MP , we can test for a main effect can easily be accommodated within the present framework.
of treatment. Table 2 shows that all Bayesian hypothesis tests For instance, one might be interested in testing which group

The American Statistician, May 2012, Vol. 66, No. 2 109


mean is different from the reference category mean. Then it (1+g)^((n-k-1)/2)*(1+g*(1-r2))^(-(n-1)/2)
is straightforward to calculate a Bayes factor to compare those *g^(-3/2)*exp(-n/(2*g))}
means, using a procedure resembling a Bayesian t-test (Gönen output[1] = ((n/2)^(1/2)/gamma(1/2))
et al. 2005). Another possibility is to apply model averaging *integrate(BF.integral,0,Inf,n=n,k=k,r2=r2)
and calculate an inclusion probability for each predictor over $value
all possible models (Clyde 1999; Hoeting et al. 1999).
Note that, although the Bayes factor already has a dimension shrinkage.integral=function(g,n=n,k=k,
penalty built in—sometimes called the Bayesian Ockham’s ra- r2=r2){
zor (Berger and Jefferys 1992)—this is not a penalty against (1+g)^((n-k-1-2)/2)*(1+g*(1-r2))^(-(n-1)/2)
multiple comparisons. To correct for multiple comparisons, the *g^(1-3/2)*exp(-n/(2*g))}
prior on the model itself must be chosen appropriately (see g.=integrate(shrinkage.integral,0,Inf,n=n,
Stephens and Balding 2009; Scott and Berger 2010, and refer- k=k,r2=r2)$value
ences therein).
In sum, we have outlined a default Bayesian hypothesis test output[2] = g. / integrate(BF.integral,0,
for ANOVA designs by a direct and simple extension of the Inf,n=n,k=k,r2=r2)$value
framework for variable selection in regression models. In the return(output)}
course of doing so we have discussed three of the most popular
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

default priors. We hope that empirical researchers and students ## (3) Function to compute the Bayes Factor
can better appreciate and understand Bayesian hypothesis test- ## with Liang et al. hyper-g prior
ing when they see how it can be applied to practical research
problems for which ANOVA is often the method of choice. hyper.g = function(y, x, a){
output = matrix(,1,2)
APPENDIX: R FUNCTIONS TO COMPUTE THE colnames(output) = c(‘BF 10’,‘g/(g+1)’)
ANOVA BAYES FACTOR n = length(y)
r2 = summary( lm(y ∼ x) )$r.squared
### k = dim(x)[2]-1
# For all functions:
# y is the response vector BF.integral=function(g, n=n, k=k, a=a,
# x is the design matrix r2=r2){
# R-scripts with the ANOVA examples can be (1+g)^((n-k-1-a)/2)*(1+g*(1-r2))^(-(n-1)/2)}
found at
# www.ruudwetzels.com output[1]=((a-2)/2)*integrate(BF.integral,0,
### Inf,n=n,a=a,k=k,r2=r2)$value
output[2]=(2/(k+a))*(f21hyper((n-1)/2,2,
## (1) Function to compute the Bayes Factor (k+a)/2+1,r2)/f21hyper((n-1)/2,1,(k+a)/2,r2))
## with Zellner’s g-prior prior return(output)}
zellner.g = function(y, x, g){
output = matrix(,1,2) [Received December 2010. Revised April 2012.]
colnames(output) = c(‘BF 10’,‘g/(g+1)’)
n = length(y)
r2 = summary( lm(y ∼ x) )$r.squared REFERENCES
k = dim(x)[2] - 1
output[1] = ( 1+g )^( (n-k-1)/2 ) Abramowitz, M., and Stegun, I. (1972), Handbook of Mathematical Functions,
*( 1+g*(1-r2) )^(-(n-1)/2) New York: Dover. [107]
output[2] = g / (g+1) Ball, R. (2005), “Experimental Designs for Reliable Detection of Linkage
return(output)} Disequilibrium in Unstructured Random Population Association Studies,”
Genetics, 170, 859–873. [104]
## (2) Function to compute the Bayes Factor Bartlett, M. S. (1957), “A Comment on D. V. Lindley’s Statistical Paradox,”
## with Jeffreys-Zellner-Siow prior Biometrika, 44, 533–534. [105]
Bayarri, M. J., and Garcı́a-Donato, G. (2007), “Extending Conventional Priors
zellnersiow = function(y, x){ for Testing General Hypotheses in Linear Models,” Biometrika, 94, 135–152.
output = matrix(,1,2) [109]
colnames(output) = c(‘BF 10’,‘g/(g+1)’) Berger, J. (2006), “The Case for Objective Bayesian Analysis,” Bayesian Anal-
n = length(y) ysis, 1, 385–402. [105]
r2 = summary( lm(y ∼ x) )$r.squared Berger, J. O., and Delampady, M. (1987), “Testing Precise Hypotheses,” Statis-
k = dim(x)[2] - 1 tical Science, 2, 317–352. [105]
Berger, J. O., and Jefferys, W. H. (1992), “The Application of Robust Bayesian
BF.integral = function(g, n = n, k = k, Analysis to Hypothesis Testing and Occam’s Razor,” Statistical Methods
r2 = r2){ and Applications, 1, 17–32. [110]

110 Teacher’s Corner


Berger, J. O., and Sellke, T. (1987), “Testing a Point Null Hypothesis: The Irrec- Liang, F., Paulo, R., Molina, G., Clyde, M., and Berger, J. (2008), “Mixtures of g
oncilability of p Values and Evidence,” Journal of the American Statistical Priors for Bayesian Variable Selection,” Journal of the American Statistical
Association, 82, 112–139. [105,108] Association, 103, 410–423. [104,106,107]
Bernardo, J. M., and Smith, A. F. M. (1994), Bayesian Theory, New York: Lindley, D. (1980), “L. J. Savage—His Work in Probability and Statistics,” The
Wiley. [104] Annals of Statistics, 8, 1–24. [105]
Box, G. E. P., and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis, Lindley, D. V. (2000), “The Philosophy of Statistics,” The Statistician, 49,
Reading: Addison-Wesley. [107,108] 293–337. [104]
Casella, G., and Moreno, E. (2006), “Objective Bayesian Variable Selection,” Maruyama, Y. (2009), “A Bayes Factor with Reasonable Model Selection Con-
Journal of the American Statistical Association, 101, 157–167. [109] sistency for ANOVA Model,” Arxiv preprint arXiv:0906.4329. [104]
Chipman, H. (1996), “Bayesian Variable Selection with Related Predictors,” Mitchell, T., and Beauchamp, J. (1988), “Bayesian Variable Selection in
Canadian Journal of Statistics, 24, 17–36. [105] Linear Regression,” Journal of the American Statistical Association, 83,
Clyde, M. (1999), “Bayesian Model Averaging and Model Search Strategies” 1023–1032. [105,109]
(with discussion), Bayesian Statistics, 6, 157–185. [109] Ntzoufras, I. (2009), Bayesian Modeling Using WinBUGS, Hoboken, NJ: Wiley.
[104]
Cui, W., and George, E. (2008), “Empirical Bayes vs. Fully Bayes Variable
Selection,” Journal of Statistical Planning and Inference, 138, 888–900. O’Hagan, A., and Forster, J. (2004), Kendall’s Advanced Theory of Statistics
[107] Vol. 2B: Bayesian Inference (2nd ed.), London: Arnold. [104]
Dellaportas, P., Forster, J., and Ntzoufras, I. (in press), “Joint Specification of O’Hara, R., and Sillanpää, M. (2009), “A Review of Bayesian Variable Selection
Model Space and Parameter Space Prior Distributions,” Statistical Science, Methods: What, How and Which,” Bayesian Analysis, 4, 85–118. [109]
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012

in press. [107] Poirier, D. J. (2006), “The Growth of Bayesian Methods in Statistics and Eco-
Dickey, J. M. (1971), “The Weighted Likelihood Ratio, Linear Hypotheses on nomics Since 1970,” Bayesian Analysis, 1, 969–980. [104]
Normal Location Parameters,” The Annals of Mathematical Statistics, 42,
Press, S., Chib, S., Clyde, M., Woodworth, G., and Zaslavsky, A. (2003), Subjec-
204–223. [105]
tive and Objective Bayesian Statistics: Principles, Models, and Applications,
Draper, N., and Smith, H. (1998), Applied Regression Analysis, New York: New York: Wiley-Interscience. [105]
Wiley-Interscience. [104,105]
Qian, S., and Shen, Z. (2007), “Ecological Applications of Multilevel Analysis
Edwards, W., Lindman, H., and Savage, L. J. (1963), “Bayesian Statistical of Variance,” Ecology, 88, 2489–2495. [104]
Inference for Psychological Research,” Psychological Review, 70, 193–242. R Development Core Team. (2004), R: A Language and Environment for Statis-
[108] tical Computing, Vienna, Austria: R Foundation for Statistical Computing,
Faraway, J. (2002), “Practical Regression and ANOVA using R,” available at ISBN 3–900051–00–3. [107]
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf. [104,108,109]
Robert, C. (1993), “A Note on Jeffreys–Lindley Paradox,” Statistica Sinica, 3,
Fernandez, C., Ley, E., and Steel, M. (2001), “Benchmark Priors for Bayesian 601–608. [105]
Model Averaging,” Journal of Econometrics, 100, 381–427. [106]
Scott, J., and Berger, J. (2010), “Bayes and Empirical-Bayes Multiplicity Ad-
Foster, D., and George, E. (1994), “The Risk Inflation Criterion for Multiple justment in the Variable-Selection Problem,” The Annals of Statistics, 38,
Regression,” The Annals of Statistics, 22, 1947–1975. [106] 2587–2619. [110]
Gelman, A. (2008), “Objections to Bayesian Statistics,” Bayesian Analysis, 3, Sellke, T., Bayarri, M. J., and Berger, J. O. (2001), “Calibration of p Values
445–450. [105] for Testing Precise Null Hypotheses,” The American Statistician, 55, 62–71.
[108]
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004), Bayesian Data
Analysis (2nd ed.), Boca Raton, FL: Chapman & Hall/CRC. [104] Sen, S., and Churchill, G. (2001), “A Statistical Framework for Quantitative
Trait Mapping,” Genetics, 159, 371–287. [104]
George, E., and McCulloch, R. (1997), “Approaches for Bayesian Variable
Selection,” Statistica Sinica, 7, 339–373. [105,109] Shafer, G. (1982), “Lindley’s Paradox,” Journal of the American Statistical
Association, 77, 325–351. [105]
Gönen, M., Johnson, W. O., Lu, Y., and Westfall, P. H. (2005), “The Bayesian
Two-Sample t test,” The American Statistician, 59, 252–257. [104,109] Stephens, M., and Balding, D. J. (2009), “Bayesian Statistical Methods for
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999), “Baye- Genetic Association Studies,” Nature Reviews Genetics, 10, 681–690. [110]
sian Model Averaging: A Tutorial,” Statistical Science, 14, 382–417. [109] Strawderman, W. (1971), “Proper Bayes Minimax Estimators of the Mul-
tivariate Normal Mean,” The Annals of Mathematical Statistics, 42,
Ishwaran, H., and Rao, J. (2003), “Detecting Differentially Expressed Genes
385–388. [107]
in Microarrays using Bayesian Model Selection,” Journal of the American
Statistical Association, 98, 438–455. [104] Westfall, P., and Gönen, M. (1996), “Asymptotic Properties of ANOVA
Bayes Factors,” Communications in Statistics-Theory and Methods, 25,
Jeffreys, H. (1961), Theory of Probability, Oxford, UK: Oxford University Press.
3101–3123. [104]
[105,106]
Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Zellner, A. (1986), “On Assessing Prior Distributions and Bayesian Regression
Statistical Association, 90, 377–395. [105] Analysis with g-Prior Distributions,” in Bayesian Inference and Decision
Techniques: Essays in Honor of Bruno de Finetti, eds. P. K. Goel and A.
Kass, R. E., and Wasserman, L. (1995), “A Reference Bayesian Test for Nested Zellner, pp. 233–243. [105]
Hypotheses and Its Relationship to the Schwarz Criterion,” Journal of the
Zellner, A. (1987), An Introduction to Bayesian Inference in Econometrics,
American Statistical Association, 90, 928–934. [106]
Malabar, FL: RE Krieger Pub. Co. [105]
Kaufman, C. G., and Sain, S. R. (2010), “Bayesian Functional ANOVA Mod-
Zellner, A., and Siow, A. (1980), “Posterior Odds Ratios for Selected Regression
eling Using Gaussian Process Prior Distributions,” Bayesian Analysis, 5,
Hypotheses,” in Bayesian Statistics, eds. J. M. Bernardo, M. H. DeGroot,
123–150. [104]
D. V. Lindley, and A. F. M. Smith, Valencia: University Press, pp. 585–603.
Kuo, L., and Mallick, B. (1998), “Variable Selection for Regression Models,” [106]
Sankhya: The Indian Journal of Statistics, Series B, 60, 65–81. [109]
Zeugner, S., and Feldkircher, M. (2009), “Benchmark Priors Revisited: On
Leamer, E. (1978), “Regression Selection Strategies and Revealed Priors,” Jour- Adaptive Shrinkage and the Supermodel Effect in Bayesian Model Averag-
nal of the American Statistical Association, 73, 580–587. [105] ing,” IMF Working Papers, 9, 202. [107]

The American Statistician, May 2012, Vol. 66, No. 2 111

You might also like