Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Variational Bayesian hierarchical regression for data

analysis
∗1
Dennis Becker
1
Department of Information Systems, Leuphana University

November 12, 2018


arXiv:1811.03687v1 [stat.AP] 8 Nov 2018

Paper Draft - arxiv.org

Abstract
Collected data, which is used for analysis or prediction tasks, often have a hierarchical structure,
for example, data from various people performing the same task. Modeling the data’s structure can
improve the reliability of the derived results and prediction performance of newly unobserved data.
Bayesian modeling provides a tool-kit for designing hierarchical models. However, Markov Chain
Monte Carlo methods which are commonly used for parameter estimation are computationally
expensive. This often renders its use for many applications not applicable. However, variational
Bayesian methods allow to derive an approximation with much less computational effort. This
document describes the derivation of a variational approximation for a hierarchical linear Bayesian
regression and demonstrates its application to data analysis.

1 Introduction
Bayesian methods allow to develop models that describe the data generation process, derive con-
fidence bounds on its estimated parameters and predictions of new observations. The Bayesian
inference to derive the model parameters, however, especially for hierarchical models, quickly be-
comes intractable. But the analyzed data often have a hierarchical structure. For example, in
social science data can be observed from different people that perform the same task. There-
fore, it appears natural to model such a structure to derive an estimate for each subject, provide
individual predictions of future observations, and conclude about the general population.
For the estimation of such models usually, Markov Chain Monte Carlo (MCMC) methods
are used, which approximate the model by sampling from a Markov chain with the stationary
distribution of the posterior distribution [6, 4]. Although these methods provide guarantees about
the samples that are taken from the targeted density [11], they are computationally expensive
even for small data sets.
Thus, variational inference can be suited for larger data sets and scenarios where the model or a
variety of models have to be estimated more quickly. It, however, only provides an approximation
of the posterior distribution, does not provide the same guarantees such as MCMC methods, and
underestimates the variance of the posterior distribution [7].
In the following, we will describe the derivation of a variational Bayesian hierarchical regression
model. The model is very closely related to the Bayesian linear regression model that has been
described by Drugowitsch (2013) and Bishop (2006), to which we add a hierarchical prior distribu-
tion. After deriving the model, we will demonstrate the use on a freely available data set [5]. An im-
plementation of the algorithm in R is available:https://github.com/dennisthemenace2/hBReg/
[email protected]

1
2 Variational Inference
Probabilistic models enable to describe how the observed variables X = xi , ..., xn have been gener-
ated by the influence of a number of latent variables Z. Under these latent variables, we summarize
all parameters that are used to model the relations in the observed data. The derived probabilistic
model is specified by the joint density p(X, Z). To derive an estimate of the latent variables given
the observed data we aim to compute p(Z|X) which is called the posterior distribution. Using
Bayes’ rule the posterior distribution is stated as:
p(X, Z) p(X|Z)p(Z) p(X|Z)p(Z)
p(Z|X) = = =R .
p(X) p(X) p(X, Z)dZ
The denominator is called the marginal likelihood or model evidence. Typically, the calculation
of the evidence is unavailable in closed-form, therefore inference in such models is often based
on Markov Chain Monte-Carlo methods [6, 4]. Alternatives are approximative methods such as
variational inference [1, 7].
Variational inference aims to approximate the posterior distribution p with a distribution q from
a set of tractable distribution Q. It, therefore, states the inference problems as an optimization
problem, which allows to estimate an approximate solution q(Z) ≈ p(Z|X).
To solve this optimization problem, we need a measure of the similarity between q and p. The
most common type of variational Bayes to describe the difference between these two distributions
is the Kullback-Leibler (KL)[8] divergence which measures the differences in the information
contained within two distributions:
X q(x)
KL(q||p) = q(x) log . (1)
x
p(x)

The difference between both distributions is always greater or equal to zero KL(q||p) ≥ 0 for all
q,p, and only equal to zero if q = p. Next, we plug the posterior distribution p(Z|X) into the KL
divergence:

X q(Z)
KL(q(Z)||p(Z|X)) = q(Z) log
p(Z|X)
Z
X q(Z)
= q(Z) log + log p(X)
p(X, Z)
Z
= KL(q(Z)||p(X, Z)) + log p(X)
First, we pulled the normalizing constant, or marginal probability p(X) out, and recognize
the remaining term as the KL divergence of the unnormalized posterior distribution. By maxi-
mizing the unnormalized KL divergence, we minimize the above defined KL divergence. With
respect to the variational distribution q(Z), the log model evidence log p(X) is constant. Since
KL(q(Z)||p(Z|X)) ≥ 0, we get by rearranging terms that
log p(X) = KL(q(Z)||p(Z|X)) − KL(q(Z)||p(X, Z)).
We notice that the log model evidence log p(X) is equal to the difference between the normalized
and unnormalized KL divergence. Furthermore, the KL divergence of the unnormalized pos-
terior distribution is a lower bound on the model evidence. Due to this property, the negative
KL(q(Z)||p(X, Z)) is called the variational lower bound or the evidence lower bound (ELBO).
Minimizing the lower bound amounts to maximizing a lower bound on the model evidence.
To complete the specification of the optimization problem, we need to describe the variational
family Q. A widely used class of distributions is the mean-field approximation, which assumes an
independent factorization:
YM
q(Z) = qm (Zm ).
m=1

2
Each latent variables Zm can be governed by its own variational factor qm (Zm ), which renders
the individual factors mutually independent. For the mean-field choice of Q, we can optimize the
problem using coordinate descent. We iterate over the variational factors qm (Zm ) and for each m
we optimize the evidence lower bound over qm while keeping the other variational factors qj (Zj )
constant. This results into the log of the optimal solution for each factor:
h i
ln qm (Zm ) = Ej6=m ln p(X, Z) + const. (2)

This procedure iteratively fits the fully-factored q(Z) = q1 (Z1 )q2 (Z2 ) · · · qm (Zm ) approximation
of p(Z|X).

2.1 Hierarchical Variational Bayesian Regression


An often encountered scenario is that the analyzed data has a hierarchical structure, for example,
people are undergoing the same treatment type. Then we can consider the data from each client as
an independent sample from the same population, which naturally suggest a hierarchical structure
of the data. Typically, the observed data consist of some target variable yi ∈ R and independent
variables xi ∈ RD for each client. If we assume a linear relationship between those, the suggested
model is a hierarchical linear regression.
We further assume that we want the benefits of a Bayesian model and that it has to be
estimated quickly, therefore we want to derive a hierarchical Bayesian linear regression model.
The derived model is very similar to the Bayesian linear regression model derived by Drugowitsch
(2013) and Bishop (2006). In addition to these models, we have the hierarchical prior for the
individual people or clients’ weights. The complete model is shown in plate notation in Figure 1.

Figure 1: Plate notation of the hierarchical linear regression model

The model is shown with all components, which makes it appear quite confusing, but this allows
to talk about the design of the probabilistic graphical model and about the reasoning behind it.
The filled nodes are observed or known and circles typically represent distributions, whereas boxes
represent constants. The rectangles with names in it are plates, which indicate that their elements

3
are observed mutable times. We start from the bottom of the illustration with the observed data
Y, regarding the nature of the target variable we assume a continues variable for the regression
model. This suggests a normal distribution N (y | µ, σ), which has two parameters, the mean value
µ and a variance parameter σ. These parameters have both to be specified by either a distribution
or a constant value. That is why these models naturally appear to grow in one direction. Next,
we discuss the plates surrounding the data. These illustrate that we have multiple clients or
subjects from which we observed repeated measures or observations. We specify the inverse of
the variance parameter using the conjugate prior a gamma distribution Gamma(σ|a0 , b0 ). The
Gamma distribution has two parameters that we need to specify. For these values, we assume both
small values, which represents an uninformative prior. The specification of this prior distribution
on the precision leads to an estimation of the noise from the data. We continue with specifying
the µ parameter of the normal distribution which is the target variable. Based on the linear
relationship between the observed data and the target variable we use matrix multiplication of the
observed data xci and a client-specific weight vector βc which is also represented by the conjugate
prior of the mean value a normal distribution. This completes the description of the observed data
and we sate the factors:
C Y
Y M
p(Y | xβi , σ) = N (yc,m | xc,m βi , σ −1 ),
i=1 c=1
p(σ) = Gamma(σ | a0 , b0 ).

In an iterative fashion, we continue describing the client individual parameters βi . The prior on
the mean value of the client individual parameters which is titled ∆ represents the population
of which all individuals are samples. The precision s describes the influence of population or
hierarchical prior on the individual. We describe the clients specific factors as:
C
Y
N βi | ∆, s−1

p(βi | ∆, s) =
i=1
p(s) = Gamma(s | c0 , d0 )

For the prior on the mean of the hierarchical prior ∆, we use the constant 0, which encourages
the weights to become small. It shrinks the weights towards 0 similar to a quadratic regularisation
term in ridge regression [2]. For the prior on the precision, we chose a Gamma distribution for
each observed variable (or Dimension) individually. The reasoning is that during optimization of
the model irrelevant coefficients will shrink automatically. This process is known as automatic
relevance determination (ARD)[9, 13, 14]. This concludes the model construction with the final
factors:

p(∆ | 0, w) = N (∆ | 0, w−1 )
D
Y
p(w) = Gamma(wd | e0 , f0 )
d=1

In order to continue with the model development and variational approximation, we state the
models joint probability:

p(Y, x, βi , σ, ∆, s, w) = p(Y | xβi , σ)p(βi | ∆, s)p(∆ | 0, w)p(σ)p(s)p(w),

4
and assume that the posterior distribution is approximated by the factored variational posterior
distributions,

p(βi , ∆, σ, s, w | D) ≈ q(βi )q(∆)q(σ)q(s)q(w).

Now, we have to derive the update equations for the variational factors using the coordinate
descent algorithm by applying Equation 2.

2.2 Update factor q(βi )


To derive the update equations, we select the factors that depend on beta, since the others are
constant and absorbed into a constant term. We replace the factors with their actual distributions
and multiply everything out to separate the terms that not depend on β, ∆ or σ out into the
constant term. The terms are rearranged and grouped until we receive a distribution that we
recognize. Since we used the conjugate prior and the product of two normal distributions, the
result will also be a normal distribution.

ln qβ∗i (βi ) = E∆,σ,s [ln p(y | xβi , σ) + ln p(βi | ∆, s)] + const


"M #  
X1 σ 1 s
2 2
= Eσ ln σ − (yc − xc βi ) + E∆,s ln s − (∆ − βi ) + C
c=1
2 2 2 2
"M #
X σ h s i
= Eσ − (yc − xc βi )2 + E∆,s − (∆ − βi )2 + const
c=1
2 2
" M M PM #
σβi2 c=1 x2c s∆2 sβ 2
 
σX 2 X
= Eσ − yc + σβi yc x c − + E∆,s − + s∆βi − i + const
2 c=1 c=1
2 2 2
" M
#
M
σβi2 c=1 x2c sβ 2
X P  
= Eσ σβi yc x c − + E∆,s s∆βi − i + const
c=1
2 2
M M
X 1 X
= (Eσ [σ] xc yc + Es [s] E∆ [∆]) βi − (Eσ [σ] x2c + Es [s]) βi2 + const
c=1
2 c=1
| {z } | {z }
Mean Variance

By completing the square over β, we can derive the parameters of the Gaussian distribution.
There are still the expectations of the variables with respect to their variational distribution,
which we consider to be constant, that we have to replace. The variational distribution of the
terms Eσ [σ] and Es [s] is a Gamma distribution and the expected value for a Gamma distribution
is E [Gamma(a, b)] = ab . Therefore, the expect values are Eσ [σ] = abnn and Es [s] = dcnn respectively.
The expectation for ∆ is with respect to a normal distribution, the expected value of a normal
distribution is the mean, which results into the expected values of E∆ [∆] = ∆. This leads to the
following update equations for βi :

M
an X 2 cn
λβi = x + ,
bn c=1 c dn
an P M cn
bn c x c yc + d n ∆
βi = .
λβi

The update equations shows that the client individual weights are dependent on their individual
data and the hierarchical prior.

5
2.3 Update factor q(∆)
Next, we estimate the update equations for the factor q(∆) to derive the variational posterior
distribution. The procedure is the same as before, we select the terms that depend on ∆ because
the other terms get absorbed into the constant term. We multiply everything out and rearrange
until we derive the same form as for the previous factor, where we recognize the terms of the
resulting normal distribution.


ln q∆ (∆) = Eβi ,s [ln p(βi | ∆, s) + ln p(∆ | µ0 , w)] + const
" C #  
X1 s 1 w
2 2
= Eβi ,s ln s − (∆ − βi ) + Ew ln w − (∆ − µ0 ) + const
i=1
2 2 2 2
" C #
X s h w i
= Eβi ,s − (∆ − βi )2 + Ew − (∆ − µ0 )2 + const
i=1
2 2
" C
#
Cs∆2 w∆2
X  
= Eβi ,s − + s∆ βi + Ew w∆ − + const
2 i=1
2
  " C
#
1 2
X
= Eβi ,s − (Cs + w)∆ + Ew,s,βi ∆(s βi + wµ0 ) + const
2 i

Similar to before, we have to fill in the expected values Ew [w] = efnn , and Eβi [βi ] = βi , and
the expected value of s has already been stated in the update of the previous factor. This leads
to the following update equations for ∆:

cn en
λ∆ = C + ,
dn fn
cn PI en
dn i βi + fn µ0
∆= .
λ∆
We indicated that fn is a vector with dimensions entries whereas en is a single value. Because
each weight will have its own Gamma prior, but the parameter e will be the same for all fn
assuming they have the same prior value. We will see this in the derivation of the update for
w. The update equations show that the hierarchical prior is similar to a mean weight, where it
consists of the sum of all weights which are scaled by the individual Gamma distributions of w.

2.4 Update factor q(σ)


Now, we derive the update equation for the precision of the noise σ. We proceed as in the previous
updates and since we use a conjugate prior, we expect a Gamma distribution as the result.

6
ln qσ∗ (σ) = Eβi [ln p(Y | xβi , σ) + ln p(σ | a, b)] + const
" C M #
XX 1 σ 2
= Eβi ln σ − (yi,c − xβi ) + [(a − 1) ln σ − bσ] + const
i=1 c=1
2 2
" C M
# PC PM
σ XX 2 1
= Eβi − (yi,c − xβi ) + (a − 1) ln σ − bσ + i=1 c=1 ln σ + const
2 i=1 c=1 2
" C M
# P C PM
σ XX 2 2 2 1
= Eβi − y − 2yi,c xβi + x βi + ((a − 1) + i=1 c=1 ) ln σ − bσ + const
2 i=1 c=1 i,c 2
 
 C C X M C X M  PC PM
 X
2
X
2
X
2 1
 1
= Eβi − ((
 xβi − 2 yi,c x βi + yi,c ) + b) σ  + ((a − 1) + i=1 c=1 ) ln σ + const

 i=1 i=1 c=1 i=1 c=1
2  | {z 2 }
Shape parameter a
| {z }
Rate parameter b

We recognize the result as a log Gamma


  distribution. This time we need to estimate
 the
expected value of the β 2 , note that E X2 = E [X] + V ar(X). This leads to Eβi βi2 = βi + λβi
for the expectation of βi2 . If we substitute the expectations and rearrange the terms, we receive
the following update equations:

C X
X M
N= 1,
i=1 c=1
N
an = a + ,
2
C X
M C
!
1 X X
bn = b + (yi,c − xi,c βi )2 + xTi λβi xi .
2 i=1 c=1 i

Considering the update equations, we notice that it represents the noise of measuring the target
with the prediction error term and the second term is the sum of the standard errors. If we assume
that the fitting error on the data is high compared to the number of the sample, the expected
value abnn of the precision would be low, so the variance is high. Of course, the opposite would be
true too if we make no prediction error but have high uncertainty in the estimated clients’ weights,
we expect the measures to be less reliable.

2.5 Update factor q(s)


The result for factor s which regulates the influence of the hierarchical prior on the client individual
weights will also be a Gamma distribution.

7
ln qs∗ (s) = Eβi ,∆ [ln p(βi | ∆, s) + ln p(s | c, d)] + const
" C D #
XX 1 s 2
= Eβi ,∆ ln s − (βi,d − ∆d ) + [(c − 1) ln s − ds] + const
i=1 d=1
2 2
" C D # PC PD
XX s
2 1
= Eβi ,∆ − (βi,d − ∆d ) + ((c − 1) + i=1 d=1 ) ln s − ds + const
i=1 d=1
2 2
" C X D C X D D
#
2 1 CD
X X X
2
= Eβi ,∆ −(( βi,d − 2 βi,d ∆d + ∆d ) + d)s + ((c − 1) + ) ln s + const
i=1 i=1
2 2
d=1 d=1 d=1

The result contains the squared expected values for β and ∆. The expected  value
 for ∆ is
derived similar to the one of β in the previous derivation which results in E∆ ∆2 = ∆ + λ∆ .
After substituting and rearranging the terms we derive the update equations:

DC
cn = c + ,
2
C D C
!
1 XX 2
X
dn = d + (βi,d − ∆d ) + Spur(λ∆ ) + Spur(λβi ) .
2 i=1 d=1 i

The update equations show that the variance depends on the squared differences between the
individual weights and the hierarchical prior, the variance of the clients’ data, and the variance of
the estimated weights. Similarly, if the difference between the hierarchical prior and the clients’
weights is large, the hierarchical prior will have less influence on the individual clients’ weights.
However, if we imagine this term to be zero, then the variance of the clients’ data and variance
of the estimated weights still contributes. This means, if there is a high variance in the clients’
data or variance between the estimated client individual weights, the influence of that prior will
be reduced.

2.6 Update factor q(w)


Finally, we derive the update equation for w. For simplicity, we show the derivation for one of the
factors.

ln qw d
(wd ) = E∆ [ln p(∆d | 0, wd ) + ln p(wd | e, f )] + const
 
1 w 2
= E∆ ln wd − (∆d ) + [(e − 1) ln wd − f wd ] + const
2 2
h w i 1
d
= E∆ − (∆d )2 + ((e − 1) + ) ln wd − f wd + const
 2  2
1 1
= E∆ −wd ( (−∆d )2 + f ) + ((e − 1) + ) ln wd + const
2 2

The result shows that this is too a log Gamma distribution, where we have to substitute the
expected values by the momentums of the distribution. To keep the notation simple, we show the
update equation for f as a vector.

1
en = e +
2
1
fn = f + (∆2 + trace(λ∆ ))
2

8
If we inspect the update equation for w, we notice how it penalizes the weights for each
dimension. The first term is a quadratic penalty for being away from zero and the second term
adds a penalty which describes the variance in the individual weights. If a particular weight varies
among the clients, this is stronger penalized than when it shows less variation among the clients.

2.7 Variational lower bound


After we derived all update equations, we know that an iterative updating of the variational
factors will maximize the variational lower bound. In order to determine when the algorithm is
converged, we estimate the value of the variational lower bound. This allows us to keep track of
the optimization process and provides an approximation of the evidence ln p(x). Therefore, we
plug our model joint distribution and variational distribution into the negative KL divergence
from Equation 1. This time, however, we use the continues definition and because the evidence
lower bound was defined as the negative divergence, the factors in the log fraction are switched.
The variational lower bound L(q) is then given by:

 
p(β, ∆, σ, s, w | D)
Z Z Z Z Z
L(q) = q(β, ∆, σ, s, w) ln dβ d∆ dσ ds dw
q(β, ∆, σ, s, w)
h i h i
= Eβ,∆,σ,s,w ln p(β, ∆, σ, s, w | D) − Eβ,∆,σ,s,w q(β, ∆, σ, s, w)
h i h i h i h i
= Eβ,σ ln p(Y | xβ, σ) + Eβ,∆,s ln p(β | ∆, s) + Ew ln p(∆ | 0, w) + Eσ ln p(σ)
h i h i h i h i h i h i
+ Es ln p(s) + Ew ln p(w) − Eβ ln q(β) − E∆ ln q(∆) − Eσ ln q(σ) − Ew ln q(w)

The terms that involve expectations of the variational distributions log q(·) are the entropies
H(·) of that distribution. The various terms are given in the following:

C C
!
h N i σ X X
Eβ,σ ln p(Y | xβ, σ) = (ψ(an) − ln bn) − (xi βi − yi )T (xi βi − yi ) + xTi λβi xi ,
2 2
C C
!
h C
i s X X
Eβ,∆,σ ln p(β | ∆, s) = (ψ(cn ) − ln dn ) − (βi − ∆)T (βi − ∆) + Spur(λ∆ ) + Spur(λβi ) ,
2 2
D D
h i X D X wd 2
E∆,σ ln p(∆ | 0, w) = ψ(en ) − ln fn − (∆d + λ∆d,d ),
h i 2 2
Eσ ln p(σ) = (a0 − 1)(ψ(an ) − ln bn) − b0 σ,
h i
Es ln p(s) = (c0 − 1)(ψ(cn ) − ln dn ) − d0 s,
h i D
X
Ew ln p(w) = (e0 − 1)(ψ(en ) − ln fnd ) − e0 wd ,
h i 1X C
Eβ ln q(β) = ln det(λβi ),
2
h i 1
E∆ ln q(∆) = ln det(λ∆ ),
h i 2
Eσ ln q(σ) = an − ln bn + ln Γ(an ) + (1 − an )ψ(an ),
h i
Es ln q(s) = cn − ln dn + ln Γ(cn ) + (1 − cn )ψ(cn ),
h i XD
Ew ln q(w) = en − ln fnd + ln Γ(en ) + (1 − en )ψ(en ),

9
where ψ(·) is the digamma function. During the optimization, the bound is maximized and
the optimization stopped when it reaches a plateau |L(qn ) − L(qn+1 )| < .

2.8 Predictive density


After we obtained the approximated posterior distribution, we might want to make predictions
for new target variables y∗ based on new observations x∗ . Therefore, we require the posterior
predictive distribution, which allows us to estimate confidence intervals. The posterior predictive
distribution for the new target variable of the newly observed sample is calculated by marginaliz-
ing the distribution of the target variable given the observation and parameters over the posterior
distribution of the parameters. We substitute the posterior distribution for the variational poste-
rior.

Z Z
p(y∗ |x∗ , D) = p(y∗ | x∗ , β, σ)p(β, σ | D) dβ dσ
Z Z
≈ p(y∗ | x∗ , β, σ)q(β, σ) dβ dσ
Z Z
N y∗ | x∗ β, σ −1 N (β | βi , λβi ) Gamma(σ | an , bn ) dβ dσ

=
Z
N y∗ | x∗ βi , σ −1 + xT∗ λβ x∗ Gamma(σ | an , bn ) dσ

=
 
an
= St y∗ , + xT∗ λβi x∗ , 2an
bn

To obtain the result, we used results for the convolution of normal and Gamma distribu-
tions [2, 10]. First, we convoluted the two normal distributions to integrate out β. The marginal
distribution of the resulting normal with the Gamma distribution results into a Student’s t distri-
bution with mean x∗ βi , precision abnn +xT∗ λβ x∗ , and 2an degrees of freedom. The result shows that
the predicted uncertainty is the sum of the noise σ −1 and the variance in client individual weights
βi . The number of degrees of freedom is approximately the number of observed samples. After
we have observed 30 samples we could approximate this with a normal distribution. Interestingly,
these samples do not have to come from an individual since it is the number of the overall observed
samples. We can also make a prediction for clients that we have never observed before by using
the hierarchical prior ∆ and the variance s.

3 Data Analysis
For a demonstration of the algorithm, we chose a freely available dataset from the machine learning
repository (http://archive.ics.uci.edu/ml/datasets/Turkiye+Student+Evaluation/). We
chose the Turkiye Student Evaluation Data Set [5], which is a data set that consists of the eval-
uation score of courses from the Gazi University in Ankara (Turkey). To rate the individual
courses, 28-course specific questions were inquired and 5 additional attributes. An overview of the
questions and items is shown in Table 1

10
Item Description
instr Instructor’s identifier
class Course code (descriptor)
repeat Number of times the student is taking this course
attendance Code of the level of attendance
difficulty Level of difficulty of the course as perceived by the student; values taken from
Q1 The semester course content, teaching method and evaluation system were provided at the start.
Q2 The course aims and objectives were clearly stated at the beginning of the period.
Q3 The course was worth the amount of credit assigned to it.
Q4 The course was taught according to the syllabus announced on the first day of class.
Q5 The class discussions, homework assignments, applications and studies were satisfactory.
Q6 The textbook and other courses resources were sufficient and up to date.
Q7 The course allowed field work, applications, laboratory, discussion and other studies.
Q8 The quizzes, assignments, projects and exams contributed to helping the learning.
Q9 I greatly enjoyed the class and was eager to actively participate during the lectures.
Q10 My initial expectations about the course were met at the end of the period or year.
Q11 The course was relevant and beneficial to my professional development.
Q12 The course helped me look at life and the world with a new perspective.
Q13 The Instructor’s knowledge was relevant and up to date.
Q14 The Instructor came prepared for classes.
Q15 The Instructor taught in accordance with the announced lesson plan.
Q16 The Instructor was committed to the course and was understandable.
Q17 The Instructor arrived on time for classes.
Q18 The Instructor has a smooth and easy to follow delivery/speech.
Q19 The Instructor made effective use of class hours.
Q20 The Instructor explained the course and was eager to be helpful to students.
Q21 The Instructor demonstrated a positive approach to students.
Q22 The Instructor was open and respectful of the views of students about the course.
Q23 The Instructor encouraged participation in the course.
Q24 The Instructor gave relevant homework assignments/projects, and helped/guided students.
Q25 The Instructor responded to questions about the course inside and outside of the course.
Q26 The Instructor’s evaluation system (midterm, assignments, etc.) effectively measured the course objectives.
Q27 The Instructor provided solutions to exams and discussed them with students.
Q28 The Instructor treated all students in a right and objective manner.
Q1-Q28 are all Likert-type, meaning that the values are taken from 1,2,3,4,5

Table 1: Items and questions in the data set

The data has a hierarchical structure in a way that each class is measured multiple times.
However, we can also see that each instructor has multiple classes, but for modeling this structure
we would have to extend the model. Therefore, we neglect this variable for now. The question of
interest is, what are the characteristics of a difficult class, and can we predict the difficulty given
the questions and items. Since the data in the data set is ordinal, it is clearly not perfectly suited
for analysis with the developed algorithm, but it should help in demonstrating the algorithm. We
start the analysis with a linear regression model and the results are shown in Table 2.
Item Estimate Std. Error t value Pr(>|t|) Item Estimate Std. Error t value Pr(>|t|)
nb.repeat 0.858 0.024 35.32 <0.000 Q14 0.082 0.048 1.72 0.086
attendance 0.453 0.011 39.58 <0.000 Q15 0.020 0.044 0.47 0.642
Q1 0.065 0.027 2.36 0.018 Q16 -0.151 0.040 -3.74 <0.001
Q2 0.035 0.035 0.99 0.320 Q17 0.201 0.035 5.81 <0.000
Q3 -0.020 0.031 -0.65 0.517 Q18 -0.074 0.039 -1.89 0.059
Q4 0.026 0.033 0.80 0.424 Q19 -0.012 0.040 -0.30 0.763
Q5 0.065 0.037 1.73 0.083 Q20 0.001 0.042 0.02 0.987
Q6 -0.019 0.034 -0.56 0.575 Q21 0.024 0.046 0.52 0.605
Q7 0.020 0.038 0.52 0.603 Q22 0.048 0.046 1.03 0.302
Q8 0.064 0.036 1.80 0.073 Q23 -0.033 0.043 -0.75 0.453
Q9 -0.037 0.030 -1.23 0.219 Q24 0.053 0.039 1.34 0.181
Q10 -0.091 0.041 -2.21 0.027 Q25 0.087 0.042 2.11 0.035
Q11 -0.028 0.031 -0.91 0.363 Q26 -0.084 0.036 -2.32 0.020
Q12 -0.026 0.030 -0.85 0.396 Q27 0.032 0.032 1.02 0.308
Q13 0.013 0.042 0.30 0.767 Q28 0.003 0.036 0.09 0.927

Table 2: Result of the linear regression to explain the difficulty of a class

The regression results show that both items, the number of a repeated taking of the class
and attendance indicate a class with higher difficulty. Apparently, students are more inclined
to visit the lectures, when the content is difficult. Regarding the questions, the 1st, 10th, 16th,
17th, 25th, and 26th show a significant relationship to the course difficulty. The first question
describes if the course topic and content were presented adequately at the start of the course,
potentially instructors of difficult classes tend to state the goals and evaluation method of difficult
classes more clearly in the beginning. The 10th item is negatively related to the difficulty, which
states if the expectancies of the course result were met. Supposedly, students are more pleased
with the outcome of the class if its perceived difficulty is lower. The 16th question is negatively
associated with the difficulty, which asks if the instructor was committed to the course and easily
understandable. Enthusiastic and clearly understandable instructors are associated with easier
overall classes. Interestingly, the 17th question inquires if the instructor arrived in time and is
positively associated with the class difficulty. This suggests that if the instructor arrives early
to the class, its content might be more difficult. Similarly, the 25th question indicates that if

11
the instructor responded to questions inside and outside the class, the course was more difficult.
Question 26 indicates that the evaluations methods of courses with higher difficulty were perceived
as effective to measure the course objective.
This provides some overview of the data. We could fit the variational model to the data to
estimate the weights and confidence intervals. However, keep in mind that variational inference
only provides an approximation and will underestimate the variance in the posterior distribution.
To visualize this, we train the model on the data and use a Gibbs sampler for the estimation of
the same model. The weight estimates and 95% confidence intervals are visualized in Figure 2

Figure 2: Comparison of the estimated weights and confidence interval of the variational approx-
imation and a Gibbs sampler

The variational model also provides a ranking of the importance of the features, which we
compare to the feature ranking of the Lasso regression[12]. For the estimation of the ranking
for the Lasso regression, we add the class variable and rank the features according to its first
appearance in the lambda path. The hierarchical model, however, does not have a variance
parameter wi for each class. Therefore, we use the estimated weights for their ranking. Because
we used two different estimates for ranking of the features for the variational model, we present
the results in separately in Table 3.
The importance ranking of the questions is similar in both models. However, the ranking for
the classes, where we use the absolute value of the weights for the variational model appears to
differ. Although both models agree on the importance of class 7, the Lasso regression ranks class
13 as second most important whereas the variational model ranks this last. Utilizing the individual
classes weights for ranking might not be a reliable method to rank the importance.
As a next objective, we are interested in the prediction performance of the variational model
in comparison to similar models. Therefore, we use 10 fold cross where we train the model on 9
folds and predict the remaining fold. Before we estimate the mean square error of that fold, we
round the predictions. For the estimation of the cross-validated error, we use linear regression,
lasso regression, ridge regression, the variational hierarchical Bayesian regression, and a linear
mixed-effects model with an individual intercept for each class.
The results of the cross-validated error in Table 4 show that the Lasso regression has the lowest
prediction error followed by the linear regression. The error of the hierarchical regression is higher
but still lower than the prediction error of the ridge regression. Its prediction error is comparable
to the error of the linear mixed effect model. Apparently, the setup of the analysis and data are
not favoring the hierarchical models, however, we hope that this provides some intuition into the
models use, capabilities, and limitations.

12
Lasso regression Variational model
1 attendance attendance
2 nb.repeat nb.repeat
3 Q17 Q17
4 Q9 Q16
5 Q11 Q26
6 Q12 Q9
7 Q10 Q22
8 Q16 Q1
Lasso Regression Variational model
9 Q1 Q25
1 class 7 class 7
10 Q22 Q18
2 class 13 class 8
11 Q25 Q5
3 class 2 class 2
12 Q26 Q2
4 class 1 class 11
13 Q18 Q10
5 class 6 class 4
14 Q4 Q11
6 class 11 class 6
15 Q2 Q3
7 class 10 class 3
16 Q5 Q14
8 class 5 class 12
17 Q14 Q4
9 class 3 class 9
18 Q3 Q21
10 class 9 class 5
19 Q21 Q12
11 class 4 class 10
20 Q24 Q27
12 class 12 class 1
21 Q7 Q7
13 class 8 class 13
22 Q27 Q20
23 Q13 Q24
24 Q19 Q28
25 Q23 Q13 Comparison of the ranking for the classes
26 Q28 Q6
27 Q20 Q19
28 Q6 Q23
29 Q8 Q15
30 Q15 Q8

Comparison of the ranking for the questions

Table 3: Comparison of estimated feature importance for Lasso regression and the variational
model

Regression Lasso Variational model LME model Ridge


Mean square error 1.45 1.43 1.45 1.45 1.54

Table 4: Cross validated mean square error of the various models

4 Concluding Remarks
We derived a variational approximation for a hierarchical Bayesian regression model and demon-
strated its application on real-world data set. The main benefit of this model might be that it
can be estimated considerably faster than the same model using Gibbs sampling. Therefore, we
can see its use for application where the model has to be estimated quickly, repeatedly, or on a
bigger dataset. Furthermore, we hope that his document could shed some light and raise interest
in Bayesian modeling, variational approximation, and its potential use in applications and data
analysis.

5 Acknowledgements
We want to thank the machine learning repository and the publisher of the data because they
made this analysis possible.

13
References
[1] MJ Matthew J Beal. Variational algorithms for approximate Bayesian inference. PhD Thesis,
(May):1–281, 2003.

[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
[3] Jan Drugowitsch. Variational Bayesian inference for linear and logistic regression. 2013.
[4] S Geman and D Geman. Stochastic relaxation, gibbs distributions, and the bayesian restora-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741.
[5] N. Gunduz and E. Fokoue. Uci machine learning repository, 2013.
[6] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications.
Biometrika, 57(1):97–109, 1970.

[7] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An intro-
duction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November
1999.
[8] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–
86, 1951.

[9] MacKay. Baysian Interpolation. MIT Press Journal, 447:415–447, 1992.


[10] Kevin P. Murphy. Machine learning : a probabilistic perspective. MIT Press, 1 edition, August
2013.
[11] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in
Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.
[12] Robert Tibshirani. Regression Selection and Shrinkage via the Lasso, 1996.
[13] Michael E. Tipping. The Relevance Vector Machine. In Advances in Neural Information
Processing Systems (NIPS’ 2000), number 1, pages 652–658, 2000.

[14] David Wipf and Srikantan Nagarajan. New view of automatic relevance determination. In
Advances in Neural Information Processing Systems 20, pages 1625–1632.

14

You might also like