BASAD Research Presentation
BASAD Research Presentation
Department of Statistics
University of Michigan, Ann Arbor
3 Scalable Computation
Skinny Gibbs Sampler
Introduction
#{j : βj 6= 0} (p ∧ n)
#{j : βj 6= 0} (p ∧ n)
Penalization methods
Bayesian Framework
Place priors on βj as
and on Z = (Z1 , . . . , Zp )
Goal:
To study model selection properties using such priors, and
to provide a practically feasible implementation even when p >> n
Two Challenges:
Theoretical Challenge: analyzing the posterior on a huge space
of 2p models
– A new framework using sample-size dependent priors
Computational Challenge: standard Gibbs samplers not scalable
due to large matrix computations
– A scalable and flexible Gibbs sampling algorithm
Outline
3 Scalable Computation
Not consistent!
Not consistent!
Not consistent!
2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n
2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n
2 → 0 (faster than 1 )
Spike prior shrinks: τ0,n n
– Does not miss “small” non-zero coefficients
2+
2 → ∞ (as ( pn ∨ 1))
Slab prior diffuses: τ1,n n
– Acts as a penalty to drive inactive covariates under Z = 0
Y | β ∼ N(X β, σ 2 I )
P(Zj = 1) = qn , j = 1, . . . , pn
σ 2 ∼ IG (α1 , α2 )
Notation
Posterior Ratios
Lemma
For any model k 6= t, we have
|k|−|t|
2 q −2 − 2
n
R̃k −R̃t
o
PR(k, t) ≤ nτ1n n exp − 2σ 2
(2+δ) √ −(|k|−|t|) n o
“≈” pn ∨ n pn exp − R̃k2σ−2R̃t ,
This implies
P[Z = k | Data] P
PR(k, t) = −→ 0, for each k 6= t
P[Z = t | Data]
P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]
P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]
|t| log pn
Remark 1: pn can be large such that n → 0.
|t| log pn
Remark 1: pn can be large such that n → 0.
and has the same rate as that of EBIC (Chen and Chen, 2008)
More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n
More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n
Outline
3 Scalable Computation
Skinny Gibbs Sampler
Scalable Computation
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity
Posterior Computation
Posterior Computation
Posterior Computation
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I
−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I
We can recover!
We can recover!
Skinny Gibbs
qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,
Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
Skinny Gibbs
qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,
Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
−2
(XA0 XA + τ1n I)−1
!
0
S=
−2 −1
0 (n + τ0n ) I
Rj
P[Zj = 1 | Z−j , Rest] =
1 + Rj
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Simulation Study
n = 100, p = 250.
Simulation Results
TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection
Simulation Results
TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection
Figure: Cross Validated Prediction Error versus Model Size for several
model selection methods
Outline
3 Scalable Computation
Censoring
Y o = (Y ∨ c)
Censoring
Y o = (Y ∨ c)
Non-convexity
n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!
n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!
Summing up
41/ 42
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity
Conclusions
42/ 42
Thank You!
and
Bhramar Mukherjee
Biotatistics, U Mich.
Juan Shen
Management, Fudan U
Minsuk Shin
Statistics, Texas A&M U
0
Minimum non-zero eigenvalues of the submatrices of XnX
with size Mn = lognpn may go to zero, but not too fast!
1/ 7
Simulation Setting and Default Parameters
Covariate Distribution:
ind
xi ∼ N(0, Σ); ΣAA = ρ1 14×4 , ΣAI = ρ2 14×p−4 , ΣII = ρ3 1.
ρ1 = ρ2 = ρ3 = 0
ρ1 = 0.1; ρ2 = 0.25; ρ3 = 0.5
pn2.1
2 1 2
τ0n = , τ1n = max ,1 ,
n 100n
2/ 7
Kolmogorov-Smirnov Distribution
It is the distribution of
K = sup |B(t)|,
t∈[0,1]
3/ 7
Computational Time
Figure: CPU time (in seconds) for BASAD and Skinny Gibbs for n = 100
and p varies.
4/ 7
Computational Complexity
5/ 7
Posterior Probabilities - Lymph Data Example
6/ 7
Non-convexity - Theoretical Details
n
P
Powell’s objective function: Pow (β) = wi ρτ (Yi − (Xi β ∨ c)) .
i=1
Pow (β) is non-convex, no global quadratic approximations. But
Local quadratic approximation:
for kβ − β0 k ≤ n = |t|2 log pn /n → 0,
Pow (β) − Pow (β0 ) = n(β − β0 )0 Dw (β − β0 ) + oP (1)
7/ 7