Bayesian Hypothesis Test - Paper
Bayesian Hypothesis Test - Paper
hypothesis test. The Bayesian test for ANOVA designs is useful Thus, the first goal of this article is to show how the Bayesian
for empirical researchers and for students; both groups will get framework of hypothesis testing with the Bayes factor can be
a more acute appreciation of Bayesian inference when they can carried out in ANOVA designs. ANOVA is one of the most
apply it to practical statistical problems such as ANOVA. We popular statistical methods to assess whether or not two or more
illustrate the use of the test with two examples, and we provide population means are equal—in most experimental settings,
R code that makes the test easy to use. ANOVA is used to test for the presence of a treatment effect.
Because of its importance and simplicity, ANOVA is taught
KEY WORDS: Bayes factor; Model selection; Teaching in virtually every applied statistics course. Nevertheless, the
Bayesian statistics. Bayesian hypothesis testing literature on ANOVA is scant; the
dominant treatment of ANOVA is still classical or frequentist
(e.g., Draper and Smith 1998; Faraway 2002) and, although
the Bayesian treatment of ANOVA is gaining popularity (e.g.,
Gelman et al. 2004; Qian and Shen 2007; Ntzoufras 2009;
1. INTRODUCTION
Kaufman and Sain 2010), the latter has dealt almost exclusively
Bayesian methods have become increasingly popular in al- with estimation, not testing (for exceptions, see Westfall and
most all scientific disciplines (e.g., Poirier 2006). One important Gönen 1996; Sen and Churchill 2001; Ishwaran and Rao 2003;
reason for this gain in popularity is the ease with which Bayesian Ball 2005; Gönen et al. 2005; Maruyama 2009). This is all the
methods can be applied to relatively complex problems involv- more surprising because Bayesian hypothesis testing has been
ing, for instance, hierarchical modeling or the comparison be- well developed for variable selection in regression models (e.g.,
tween nonnested models. However, Bayesian methods can also Liang et al. 2008), of which ANOVA is a special case.
be applied in simpler statistical scenarios such as those that The second goal of this article is to describe the rationale be-
feature basic testing procedures. Prominent examples of such hind a particular family of default priors—g-priors—and to use
procedures include analysis of variance (ANOVA) and the t- these g-priors for default Bayesian tests for ANOVA designs.
test; these tests are the cornerstone of data analysis in fields We hope this work shows students and experimental researchers
such as biology, economics, sociology, and psychology. how Bayesian hypothesis tests can be a valid and practical al-
Because Bayesian methods have become more mainstream in ternative to classical or frequentist tests.
recent years, most technically oriented studies now offer at least The outline of this article is as follows. In the first section we
one course on Bayesian inference in their graduate or undergrad- briefly cover Bayesian estimation and Bayesian model selection.
uate program. Our own experience in teaching one such course In the second section we describe the various g-priors that have
is that students often ask the same questions when Bayesian been proposed in the literature on variable selection in regression
model selection and hypothesis testing are introduced. First, models. Finally, we present two worked examples that show
students are interested to know how they can apply Bayesian how the regression framework can be applied to one-way and
methods to testing problems that they face on a regular basis; two-way ANOVA designs.
estimator for β, var(β̂), equals φ −1 (X T X)−1 . Hence, the term g suggested to apply a Cauchy prior. The Cauchy prior was the
is a scaling factor for the prior: if we choose g to be 1, we give simplest distribution to satisfy consistency requirements that
the prior the same weight as the sample; if we choose g to be 2, Jeffreys considered important for hypothesis testing. One such
the prior is half as important as the sample; if we choose g to be requirement is that a researcher does not want to favor one model
n, the prior is 1/nth as important as the sample. over another on the basis of a single datum.
An obvious problem with this prior distribution is how to Extending Jeffreys’ suggestion to variable selection in the
set parameter g. If g is set low, then the prior distribution for regression model, Zellner and Siow (1980) proposed a multi-
β is relatively peaked and informative. If g is set high then this variate Cauchy prior on the regression coefficients and a flat
prior is relatively spread out and uninformative. However, as prior on the common intercept. However, as the marginal like-
described in the previous section, a prior that is too vague can lihood is not analytically tractable, this approach did not gain
result in the Jeffreys–Lindley–Bartlett paradox. much popularity.
Various settings for g have been studied and proposed. A Recently, however, Liang et al. (2008) represented the
popular setting is g = n, corresponding to the so-called “unit JZS prior as a mixture of g-priors, that is, an Inverse-
information prior.” The intuition is that this prior contains as Gamma(1/2, n/2) prior on g and Jeffreys’ prior on the precision
much information as present in a single observation (Kass and φ:
Wasserman 1995); the argument is that the precision of the sam-
1
ple estimate of β contains the information of n observations. p(φ) ∝
Then the amount of information in an imaginary single obser- φ g
vation is this quantity divided by n, hence g = n. Another well- p(β|φ, g, X) ∝ N 0, (X T X)−1 p(g)dg
known choice of g is to set it equal to the square of the number of φ
predictors of the regression model: g = k 2 (i.e., the Risk Infla- (n/2)1/2 −3/2 −n/(2g)
p(g) = g e .
tion Criterion, Foster and George 1994). Furthermore, Fernan- (1/2)
dez, Ley, and Steel (2001) suggested to take g = max{n, k 2 } as
a “benchmark prior.” This formulation combines the computational advantages of the
A quantity of interest is the so-called shrinkage factor g/(g + g-prior with the statistical advantages of the Cauchy prior. Note
1). It can be used to estimate the posterior mean of β, which is that again we assume that the columns of X are centered.
the least squares estimate of β multiplied by the shrinkage factor By assigning a prior to g, we avoid having to assign g a
g specific value; moreover, the prior on g allows us to estimate g
E[β|Y , X, M, g] = β̂, from the data and obtain data-dependent shrinkage. Equation (2)
g+1
gives the expected value of the shrinkage factor g/(g + 1) with
where β̂ is the least squares estimate of β. A low value of g the JZS approach:
pulls the posterior mean of β to zero, whereas a high value
of g yields results similar to the least squares estimate. Note g
E |Y , M
that, somewhat confusingly, a low shrinkage factor means more g+1
shrinkage and vice versa. ∞
To compute the Bayes factor in the one-way ANOVA design, = (1 + g)(n−k−3)/2 [1 + g(1 − R 2 )]−(n−1)/2
0
we compare the full model, MF to the null model, MN . Then,
∞
(−1/2) −n/(2g)
the Bayes factor is given by ×g e dg (1 + g)(n−k−1)/2
0
BF[MF : MN ] = (1 + g)(n−k−1)/2 [1 + g(1 − R 2 )]−(n−1)/2 ,
Density
model to the null model is:
(n/2)1/2
BF[MF : MN ] =
(1/2)
∞
× (1+g) (n−k−1)/2
[1+g(1 − R 2 )]−(n−1)/2 g −3/2 e−n/(2g) dg.
0
(3) 0
0 g/(g+1) 1
As pointed out by Liang et al. (2008), the integral is one-
dimensional and easily approximated using standard software Figure 1. Effect of parameter a on the shrinkage factor g/(g + 1).
packages such as R (R Development Core Team 2004). When a = 4, the prior is uniform between 0 and 1, whereas when a is
A drawback of the JZS prior is that the Bayes factor is not very close to 2, the prior distribution for the shrinkage factor has most
analytically available. However, the JZS prior is not vulnerable mass near 1. Higher values for g/(g + 1) result in less shrinkage.
Downloaded by [UVA Universiteitsbibliotheek SZ] at 03:17 06 November 2012
equal for all batches, although such an inference in favor of the model MN ; the two hyper-g priors also provide evidence in fa-
null hypothesis is not warranted in the Fisherian framework of vor of MN , albeit less extreme. Moreover, the relation between
p-value significance testing. R 2 and the shrinkage factor now becomes clear: for each prior
Next, we designed a Bayesian hypothesis test to contrast two where g is estimated (i.e., JZS, hyper-g with a = 3, and hyper-g
models. The full model, MF , contains a grand mean α and with a = 4), the shrinkage factor is lower when the null model
the predictors for batches 1–5. The predictor for batch 6 is is preferred, as is the case for the modified data.
omitted because of the sum-to-zero constraint. The null model, Finally, we use the original dyestuff data with unequal means
MN , contains no predictors. Therefore, our test concerns the to illustrate the Jeffreys–Lindley–Bartlett paradox for the one-
following two models: way ANOVA model. Under Zellner’s g-prior with g = n or
g = k 2 , the Bayes factor was in favor of the full model. However,
MF : µ = 1n α + X 1 β1 + X 2 β2 + X 3 β3 + X 4 β4 + X 5 β5 Figure 3 shows that by increasing g the Bayes factor can be made
MN : µ = 1n α. arbitrarily close to 0, indicating impressive evidence in favor of
the null model.
The results from the Bayesian hypothesis test for the data
with unequal group means, reported in Table 1, show that the
two Zellner g-priors and the JZS prior yield only modest Bayes 5. A BAYESIAN TWO-WAY ANOVA
factor support in favor of MF ; the two hyper-g priors yield To illustrate the Bayesian two-way ANOVA, we use a slightly
more convincing support in favor of MF : overall, the results more complex example from Faraway (2002). As part of an
suggest that the data may be too sparse to allow an unambiguous investigation of toxic agents, a total of 48 rats were allocated
conclusion. Importantly, a Bayes factor of 3 arguably does not to three poisons (I, II, III) and four treatments (A, B, C, D).
inspire as much confidence as one would glean from a p-value The dependent variable is the reciprocal of the survival time
as low as 0.004 (Berger and Sellke 1987). This result highlights in tens of hours, which can be interpreted as the rate of dying.
the general conflict between Bayes factors and p-values in terms Figure 4 shows the box-and-whisker plot of the survival times
of their evidential impact (e.g., Edwards, Lindman, and Savage in the different experimental conditions.
1963; Sellke, Bayarri, and Berger 2001).
When the models are compared using the modified data,
Table 1 shows that the two Zellner g-priors and the JZS prior
yield considerable Bayes factor support in favor of the null
Table 1. Bayes factors and shrinkage factors for the one-way ANOVA
example on the dyestuff data, see Figure 2. The Bayes factor compares
the full model to the null model, testing for a main effect of batch.
Zellner g = n 2.0 0.97 1.87 × 10−4 0.97 Figure 4. Rate of dying per poison and per treatment. Poison group I
Zellner g = k 2 2.9 0.96 2.90 × 10−4 0.96 (the reference level for poison) has a mean of 1.80; the means of groups
JZS 3.1 0.90 8.51 × 10−4 0.86 II and III are 0.47 and 2.00 higher, respectively. Treatment group A (the
Hyper-g a = 3 9.9 0.71 0.17 0.25 reference level for treatment) has a mean of 3.52; the means of groups
Hyper-g a = 4 10.1 0.65 0.29 0.22
B, C, and D are 1.66, 0.57, and 1.36 lower, respectively.
test for the effect of the interaction terms we define two models: cific choice of prior distribution (see Table 2). The support for
the full model containing the main and interaction effects MP T , the model with a main effect of poison is considerably higher
and the same model without the interaction effects MP +T . To then the support for the main effect for treatment.
test for the main effects we define the no-interaction model with
the effects of treatment MT ; the no-interaction model with the 6. CONCLUSION
effects of poison MP ; and the null model MN .
ANOVA is one of the most often-used statistical methods
MP T : µ=1n α + X I βI + X I I βI I + X A βA + X B βB + X C βC in the empirical sciences. However, Bayesian hypothesis tests
+ X I ×A βI ×A + X I ×B βI ×B + X I ×C βI ×C are rarely conducted in ANOVA designs; instead, most the-
+ X I I ×A βI I ×A + X I I ×B βI I ×B + X I I ×C βI I ×C, oretical development has concerned the more general prob-
MP +T : µ=1n α + X I βI + X I I βI I + X A βA + X B βB + X C βC, lem of selecting variables in regression models (e.g., Mitchell
and Beauchamp 1988; George and McCulloch 1997; Kuo
MT : µ=1n α + X A βA + X B βB + X C βC,
and Mallick 1998; Casella and Moreno 2006; O’Hara and
MP : µ=1n α + X I βI + X I I βI I, Sillanpää 2009). Here we showed how the regression frame-
MN : µ=1n α. work can be seamlessly carried over to ANOVA designs, at the
same time illustrating various default prior distributions, such
We compare the reduced models to the larger model to test for
as Zellner’s g-prior, the JZS approach, and the hyper-g approach
the effect of the predictors that were left out. If the larger model
(for a similar approach see Bayarri and Garcı́a-Donato 2007).
is preferred over the reduced model then the tested effects mat-
Of course, other Bayesian model specifications for ANOVA
ter. However, these models cannot be compared directly using
are possible; ours has the advantage that it follows directly
the methods outlined above, as these methods always feature
from the regression approach that has been studied in detail. A
the null model. Instead, we first calculate the Bayes factor com-
further didactical advantage is that many students are already
paring the larger model ML to the null model, BF[ML : MN ],
familiar with linear regression and the extension to ANOVA is
and the reduced model MR to the null model, BF[MR : MN ].
conceptually straightforward. In addition, software programs
The desired Bayes factor, BF[ML : MR ], is then obtained by
implemented in R make it easy for students and teachers to
taking the ratio of Bayes factors
apply Bayesian regression and ANOVA to inference problems
BF[ML : MN ] of practical interest; in addition, this software allows users to
BF[ML : MR ] = . compare the substantive Bayesian conclusions to those drawn
BF[MR : MN ]
from the classical p-value approach. In general, the software
We do not present the shrinkage factors because the model implementation of the theoretical framework provides students
comparison is not between the null model and the full model with the opportunity of considerable hands-on experience with
but between two models with many predictors each. Bayesian hypothesis testing, something that is likely to increase
A test for the interaction involves the comparison between not only their understanding, but also their motivation to learn.
MP T and MP +T . Table 2 shows the results for the Bayes factors We feel it is important for students to realize that there is likely
that test for the presence of the interaction terms. The different no single correct prior distribution; in fact, it can be informative
priors do not change the overall conclusion: all priors support to use different priors in a sensitivity analysis. If different plau-
the model without the interaction terms. Hence, we drop the sible prior distributions lead to different substantive conclusions
interaction terms from the ANOVA model and proceed with the it is best to acknowledge that the data are ambiguous.
main effects only. Although not the focus of this article, post hoc comparisons
By comparing MP +T to MP , we can test for a main effect can easily be accommodated within the present framework.
of treatment. Table 2 shows that all Bayesian hypothesis tests For instance, one might be interested in testing which group
default priors. We hope that empirical researchers and students ## (3) Function to compute the Bayes Factor
can better appreciate and understand Bayesian hypothesis test- ## with Liang et al. hyper-g prior
ing when they see how it can be applied to practical research
problems for which ANOVA is often the method of choice. hyper.g = function(y, x, a){
output = matrix(,1,2)
APPENDIX: R FUNCTIONS TO COMPUTE THE colnames(output) = c(‘BF 10’,‘g/(g+1)’)
ANOVA BAYES FACTOR n = length(y)
r2 = summary( lm(y ∼ x) )$r.squared
### k = dim(x)[2]-1
# For all functions:
# y is the response vector BF.integral=function(g, n=n, k=k, a=a,
# x is the design matrix r2=r2){
# R-scripts with the ANOVA examples can be (1+g)^((n-k-1-a)/2)*(1+g*(1-r2))^(-(n-1)/2)}
found at
# www.ruudwetzels.com output[1]=((a-2)/2)*integrate(BF.integral,0,
### Inf,n=n,a=a,k=k,r2=r2)$value
output[2]=(2/(k+a))*(f21hyper((n-1)/2,2,
## (1) Function to compute the Bayes Factor (k+a)/2+1,r2)/f21hyper((n-1)/2,1,(k+a)/2,r2))
## with Zellner’s g-prior prior return(output)}
zellner.g = function(y, x, g){
output = matrix(,1,2) [Received December 2010. Revised April 2012.]
colnames(output) = c(‘BF 10’,‘g/(g+1)’)
n = length(y)
r2 = summary( lm(y ∼ x) )$r.squared REFERENCES
k = dim(x)[2] - 1
output[1] = ( 1+g )^( (n-k-1)/2 ) Abramowitz, M., and Stegun, I. (1972), Handbook of Mathematical Functions,
*( 1+g*(1-r2) )^(-(n-1)/2) New York: Dover. [107]
output[2] = g / (g+1) Ball, R. (2005), “Experimental Designs for Reliable Detection of Linkage
return(output)} Disequilibrium in Unstructured Random Population Association Studies,”
Genetics, 170, 859–873. [104]
## (2) Function to compute the Bayes Factor Bartlett, M. S. (1957), “A Comment on D. V. Lindley’s Statistical Paradox,”
## with Jeffreys-Zellner-Siow prior Biometrika, 44, 533–534. [105]
Bayarri, M. J., and Garcı́a-Donato, G. (2007), “Extending Conventional Priors
zellnersiow = function(y, x){ for Testing General Hypotheses in Linear Models,” Biometrika, 94, 135–152.
output = matrix(,1,2) [109]
colnames(output) = c(‘BF 10’,‘g/(g+1)’) Berger, J. (2006), “The Case for Objective Bayesian Analysis,” Bayesian Anal-
n = length(y) ysis, 1, 385–402. [105]
r2 = summary( lm(y ∼ x) )$r.squared Berger, J. O., and Delampady, M. (1987), “Testing Precise Hypotheses,” Statis-
k = dim(x)[2] - 1 tical Science, 2, 317–352. [105]
Berger, J. O., and Jefferys, W. H. (1992), “The Application of Robust Bayesian
BF.integral = function(g, n = n, k = k, Analysis to Hypothesis Testing and Occam’s Razor,” Statistical Methods
r2 = r2){ and Applications, 1, 17–32. [110]
in press. [107] Poirier, D. J. (2006), “The Growth of Bayesian Methods in Statistics and Eco-
Dickey, J. M. (1971), “The Weighted Likelihood Ratio, Linear Hypotheses on nomics Since 1970,” Bayesian Analysis, 1, 969–980. [104]
Normal Location Parameters,” The Annals of Mathematical Statistics, 42,
Press, S., Chib, S., Clyde, M., Woodworth, G., and Zaslavsky, A. (2003), Subjec-
204–223. [105]
tive and Objective Bayesian Statistics: Principles, Models, and Applications,
Draper, N., and Smith, H. (1998), Applied Regression Analysis, New York: New York: Wiley-Interscience. [105]
Wiley-Interscience. [104,105]
Qian, S., and Shen, Z. (2007), “Ecological Applications of Multilevel Analysis
Edwards, W., Lindman, H., and Savage, L. J. (1963), “Bayesian Statistical of Variance,” Ecology, 88, 2489–2495. [104]
Inference for Psychological Research,” Psychological Review, 70, 193–242. R Development Core Team. (2004), R: A Language and Environment for Statis-
[108] tical Computing, Vienna, Austria: R Foundation for Statistical Computing,
Faraway, J. (2002), “Practical Regression and ANOVA using R,” available at ISBN 3–900051–00–3. [107]
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf. [104,108,109]
Robert, C. (1993), “A Note on Jeffreys–Lindley Paradox,” Statistica Sinica, 3,
Fernandez, C., Ley, E., and Steel, M. (2001), “Benchmark Priors for Bayesian 601–608. [105]
Model Averaging,” Journal of Econometrics, 100, 381–427. [106]
Scott, J., and Berger, J. (2010), “Bayes and Empirical-Bayes Multiplicity Ad-
Foster, D., and George, E. (1994), “The Risk Inflation Criterion for Multiple justment in the Variable-Selection Problem,” The Annals of Statistics, 38,
Regression,” The Annals of Statistics, 22, 1947–1975. [106] 2587–2619. [110]
Gelman, A. (2008), “Objections to Bayesian Statistics,” Bayesian Analysis, 3, Sellke, T., Bayarri, M. J., and Berger, J. O. (2001), “Calibration of p Values
445–450. [105] for Testing Precise Null Hypotheses,” The American Statistician, 55, 62–71.
[108]
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004), Bayesian Data
Analysis (2nd ed.), Boca Raton, FL: Chapman & Hall/CRC. [104] Sen, S., and Churchill, G. (2001), “A Statistical Framework for Quantitative
Trait Mapping,” Genetics, 159, 371–287. [104]
George, E., and McCulloch, R. (1997), “Approaches for Bayesian Variable
Selection,” Statistica Sinica, 7, 339–373. [105,109] Shafer, G. (1982), “Lindley’s Paradox,” Journal of the American Statistical
Association, 77, 325–351. [105]
Gönen, M., Johnson, W. O., Lu, Y., and Westfall, P. H. (2005), “The Bayesian
Two-Sample t test,” The American Statistician, 59, 252–257. [104,109] Stephens, M., and Balding, D. J. (2009), “Bayesian Statistical Methods for
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999), “Baye- Genetic Association Studies,” Nature Reviews Genetics, 10, 681–690. [110]
sian Model Averaging: A Tutorial,” Statistical Science, 14, 382–417. [109] Strawderman, W. (1971), “Proper Bayes Minimax Estimators of the Mul-
tivariate Normal Mean,” The Annals of Mathematical Statistics, 42,
Ishwaran, H., and Rao, J. (2003), “Detecting Differentially Expressed Genes
385–388. [107]
in Microarrays using Bayesian Model Selection,” Journal of the American
Statistical Association, 98, 438–455. [104] Westfall, P., and Gönen, M. (1996), “Asymptotic Properties of ANOVA
Bayes Factors,” Communications in Statistics-Theory and Methods, 25,
Jeffreys, H. (1961), Theory of Probability, Oxford, UK: Oxford University Press.
3101–3123. [104]
[105,106]
Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Zellner, A. (1986), “On Assessing Prior Distributions and Bayesian Regression
Statistical Association, 90, 377–395. [105] Analysis with g-Prior Distributions,” in Bayesian Inference and Decision
Techniques: Essays in Honor of Bruno de Finetti, eds. P. K. Goel and A.
Kass, R. E., and Wasserman, L. (1995), “A Reference Bayesian Test for Nested Zellner, pp. 233–243. [105]
Hypotheses and Its Relationship to the Schwarz Criterion,” Journal of the
Zellner, A. (1987), An Introduction to Bayesian Inference in Econometrics,
American Statistical Association, 90, 928–934. [106]
Malabar, FL: RE Krieger Pub. Co. [105]
Kaufman, C. G., and Sain, S. R. (2010), “Bayesian Functional ANOVA Mod-
Zellner, A., and Siow, A. (1980), “Posterior Odds Ratios for Selected Regression
eling Using Gaussian Process Prior Distributions,” Bayesian Analysis, 5,
Hypotheses,” in Bayesian Statistics, eds. J. M. Bernardo, M. H. DeGroot,
123–150. [104]
D. V. Lindley, and A. F. M. Smith, Valencia: University Press, pp. 585–603.
Kuo, L., and Mallick, B. (1998), “Variable Selection for Regression Models,” [106]
Sankhya: The Indian Journal of Statistics, Series B, 60, 65–81. [109]
Zeugner, S., and Feldkircher, M. (2009), “Benchmark Priors Revisited: On
Leamer, E. (1978), “Regression Selection Strategies and Revealed Priors,” Jour- Adaptive Shrinkage and the Supermodel Effect in Bayesian Model Averag-
nal of the American Statistical Association, 73, 580–587. [105] ing,” IMF Working Papers, 9, 202. [107]