1 s2.0 S0377042714002775 Main

Journal of Computational and Applied Mathematics 273 (2015) 251–263
Contents lists available at ScienceDirect
Journal of Computational and Applied

Mathematics
journal homepage: www.elsevier.com/locate/cam
Parallel maximum likelihood estimator for multiple linear

regression models
Guangbao Guo a,b,∗ , Wenjie You c , Guoqi Qian d , Wei Shao e
a
Department of Statistics, Shandong University of Technology, Zibo 255000, China
b
School of Mathematics, Shandong University, Jinan 250100, China
c
Department of Automation, Xiamen University, Xiamen 361005, China
d
Department of Mathematics and Statistics, The University of Melbourne Parkville VIC 3010, Australia
e
School of Management, Qufu Normal University, Rizhao 276800, China
article info abstract

Article history: Consistency and run-time are important questions in performing multiple linear regres-
Received 23 August 2012 sion models. In response, we introduce a new parallel maximum likelihood estimator for
Received in revised form 29 September multiple linear models. We first provide an equivalent condition between the method and
2013
the generalized least squares estimator. We also consider the rank of projections and the
eigenvalue. We then present consistency when a stable solution exists. In this paper, we
MSC:
describe several consistency theorems and perform experiments on consistency, outlier,
62J05
65F15
and scalability. Finally, we fit the proposed method onto bankruptcy data.
65G50 © 2014 Elsevier B.V. All rights reserved.
Keywords:
Multiple linear regression models
Parallel computing
Maximum likelihood estimator
Consistency
Outlier
1. Introduction
Multiple linear regression models (MLRMs) are widely used in many statistical problems. Several parallel methods and
specified criterion of chosen subsets have been proposed in recent years (see [1]) to improve run-time and computational
effect. We provide a brief overview of parallel methods that are useful for solving the MLRMs.
Mitchell and Beauchamp [2] created a parallel method for the subset selection problem using a Bayesian perspective.
Havránek and Stratkoš [3] considered parallel methods for the Cholesky factorization in multiple linear models, and showed
that the methods’ performance is independent of the size of data sets.
Xu et al. [4] suggested a form of stochastic domain decomposition in multiple linear models to improve performance and
to resist processor failure. The general concept of domain decomposition is decomposing the data so that the processors
have data sets of nearly the same sizes and computation times. The importance of size during parallel communication was
similarly considered.
Skvoretz et al. [5] experimented with MLRMs in social science research. The parallel component of their computation was
computing for a covariance matrix through a single program multiple data stream. Different numbers of processors were
∗ Corresponding author at: Department of Statistics, Shandong University of Technology, Zibo 255000, China.
E-mail address: [email protected] (G. Guo).
http://dx.doi.org/10.1016/j.cam.2014.06.005
0377-0427/© 2014 Elsevier B.V. All rights reserved.
252 G. Guo et al. / Journal of Computational and Applied Mathematics 273 (2015) 251–263
used in the experiments, and the varying amounts of data were read from a disk. They found that the latter consideration
was critical in obtaining good performance.
Bouyouli et al. [6] developed global minimal and global orthogonal residual methods for MLRMs, all of which were block
Krylov subspace methods of parallel methods.
MLRMs allow for a highly effective parallel implementation, elegantly illustrating our point and encouraging further
development in theory and application. This work originates from the statistical analysis of multiple linear models [7] in
statistical tests and from several examples of parallel maximum likelihood estimator (PMLE). Properties of stochastic domain
decomposition were studied for the maximum likelihood estimator (MLE) in multiple linear models.
We provide a general MLE in multiple linear models. Suppose that MLRMs have the following form:
Y = X β + ε, ε ∼ N (0, σ 2 I ), (1.1)
where X ∈ R is a known matrix of fixed rank, rank(X ) = p, p ≪ n, Y ∈ R
n×p
is an observable random vector, β ∈ Rp×1
n×1
is a vector of unknown parameters, I ∈ Rn×n is a known unit matrix, and σ 2 is a positive unknown parameter.
The MLE is often used to estimate unknown parameters in multiple linear models. The MLE of β under the model (1.1)
is defined as
β̂ = arg min(Y − X β)T (Y − X β). (1.2)
β
We then have
β̂ = (X T X )−1 X T Y . (1.3)
The PMLE method in (1.1) is as follows: first, (X , Y ) is sent to the r processor respectively; second, different elements of
(X , Y ) are acquired by stochastic domain decomposition in each processor, denoted as (Xi , Yi ); the MLE is then computed,
the PMLE is obtained using the estimator in each processor. The PMLE method is a domain decomposition method, and has
a short run-time on large data sets. Although there are a number of existing methods for MLRM, the method is more faster,
and more robust in some cases.
We organized the rest of this paper as follows. In Section 2, we introduce the PMLE method (see [8]), and provides an
equivalence condition of the PMLE and a generalized least squares estimator. In Section 3, we consider first the rank of pro-
jections in the PMLE method, followed by the eigenvalue. We study the consistency of the PMLE in Section 4. In Section 5, we
illustrate the method through several experiments studies, including those on consistency, outlier and scalability. Experi-
ments with bankruptcy data are also provided. Section 6 discusses future research. The Appendix lists the technical results.
2. PMLE of MLRMs
In this section, we introduce the matrix form of the proposed PMLE in (1.1). We assume that Xi (i = 1, . . . , r ) are the
subsamples of the observed sample X , where X ∈ Rn×1 . Write
Xi = R i X , Ei = RTi Ri ,
Ei = diag{α1 , α2 , . . . , αn }, rank{Ei } = n0 ≥ p, i = 1, . . . , r . (2.1)
Here, Ri is the projection operator. i=1 αi = n0 , αi ∼ B(n0 , 1/n), E (αi ) = n0 /n. Note that
n
r

I ≤ Ei ≤ qI . (2.2)
i=1
Here q is the number of matrices Ei with a nonzero in the row.

Let Yi = Ri Y such that Yi = Xi βi + εi , E (εi ) = 0, and
n0 T n0 T
E (X T Ei X βi |X ) = X X βi ; E (X T Ei Y |X ) = X Y , i = 1, . . . , r .
n n
We write
β̂i = (X T Ei X )− X T Ei Y , i = 1, . . . , r . (2.3)
Assume that
r r
1 1
β̃ = β̂i =
(X T Ei X )− X T Ei Y , (2.4)
r i =1 r i=1
which is the PMLE in (1.1). Then the parallel estimator is a generalized least squares (GLS) method with domain decompo-
sition. In order to have a good understanding of the method, we give an illustration of the PMLE method in Fig. 1. Now we
give a following remark about the method.
We establish the conditions under which the covariance matrix Σ exists, such that the PMLE is equivalent to the GLS
estimator.
β̂G = (X T Σ −1 X )−1 X T Σ −1 Y , (2.5)
where Σ ∈ Rn×n is a nonsingular matrix.
G. Guo et al. / Journal of Computational and Applied Mathematics 273 (2015) 251–263 253
Fig. 1. r is the number of processor; n0 is the length of every subsample.
If l exists such that {Erj , j = 1, . . . , l} ∈ {Ei , i = 1, . . . , r }, and

  −1  
r l l
1  
(X Ei X ) X Ei = X
T − T T
Erj X T
Erj , (2.6)
r i=1 j =1 j =1
where rank( Erj ) = n, l < r, we obtain the following equivalence proposition of the PMLE and the GLS estimator.
l
j =1
Proposition 2.1. When {Erj , j = 1, . . . , l} satisfies Eq. (2.6), the PMLE of MLRMs is a GLS estimator.
Remark. In Eq. (2.6), if {Erj , j = 1, . . . , l} satisfies

l

Erj = I ,
j=1
the PMLE of MLRMs is a MLE.
3. Properties of the PMLE in MLRMs
In this section, we investigate the properties of PMLE such that the rank of the projections is {Pi = X (X T Ei X )− X T Ei , i =
1, 2, . . . , r }, as well as the eigenvalue bounds of {XiT Xi ± X T (I − Ei )X , i = 1, 2, . . . , r }.
3.1. Rank of projections for the PMLE in MLRMs
The following is a theorem on the rank of projections for the PMLE in multiple linear models.
Theorem 3.1. Assuming that the projections are {Pi = X (X T Ei X )− X T Ei , i = 1, 2, . . . , r } in (2.2), we then have
max{rank(Pi )} = rank(X );
i
min{rank(Pi )} = min rank(Ei X );

i i
min{Ip − rank(Pi )} = p − rank(X ).

i
3.2. Eigenvalue of the PMLE in MLRMs
For a fixed Xi and a fixed Ei in (2.1), we seek the eigenvalue bounds of

{XiT Xi ± X T (I − Ei )X }.
Let V = (I − Ei )X . Then, V T V = X T (I − Ei )X and XiT Xi + V T V = X T X .
We write
λ21
 
XT X = P  .. P ,
 T
.

λ2p
where P is an orthogonal matrix, λ21 ≤ · · · ≤ λ2p .
ν12
 
XiT Xi = P̃  ..  P̃ ,
 T
.

νp2
where P̃ is an orthogonal matrix, and ν12 ≤ · · · ≤ νp2 .
XiT Xi + V T V = X T Ei X + X T (I − Ei )X = X T X .
Let P̃ = (p1 , P̃1 ) with P̃1 = (p2 , . . . , pp ). If ν12 > 0, we have
ν12 ≤ λ21 ≤ ν22 , (3.1)
ν − λmax (V V ) ≤ λ
2
1
T T
(
min Xi Xi − V V) ≤ ν .
T 2
1 (3.2)
We assign y = (y1 , . . . , yp , )T , yyT = V T V and y1 ≤ · · · ≤ yp . Defining the projections of the vector y onto the eigenvec-
tors of XiT Xi ,
yi:j = (pi , . . . , pj )T y, 1 ≤ i ≤ j ≤ p.
For a fixed Ei , the smallest and largest eigenvalues are bound in terms of projections onto the two-dimensional subspaces
below.
Theorem 3.2. If yyT = V T V , y = (y1 , . . . , yp , )T and
ν12 y21
   
0 y1 ∥y2:p ∥
L± = ± ,
0 ν22 y1 ∥y2:p ∥ ∥y2:p ∥2
ν1
 2
y21
  
0 y1 y2
U± = ± ,
0 ν22 y1 y2 y22
then, λmin (L± ) ≤ λmin (XiT Xi ± V T V ) ≤ λmin (U± ), where
ν12 ≤ λmin (L+ ) ≤ λmin (U+ ), ν12 − µp ≤ λmin (L− ) ≤ λmin (U− ) ≤ ν12 .
We can improve Weyl’s theorem for the largest eigenvalue of X T X by
νp2 ≤ λmax (X T X ) ≤ νp2 + ∥y∥2 ,

ν22 ≤ λmax (XiT Xi − yyT ) ≤ νp2 + ∥y∥2 .
The following theorem is for a fixed Xi .
Theorem 3.3. Let yyT = V T V , y = (y1 , . . . , yp )T and
νp2 y21
   
0 y1 y2
L± = ± ,
0 νp2−1 y1 y2 y22
νp2 y21
   
0 y1 ∥y2:p ∥
U± = ± .
0 νp−1 22 y1 ∥y2:p ∥ ∥y2:p ∥2
Then, λmax (L± ) ≤ λmax (XiT Xi ± V T V ) ≤ λmax (U± ), where
νp2 ≤ λmax (L+ ) ≤ λmax (U+ ), νp2−1 ≤ λmax (L− ) ≤ λmax (U− ) ≤ νp2 .
4. Consistency of the PMLE in MLRMs
In this section, we present several theoretical studies on the consistency

 of PMLE
 in MLRMs. We expect that as r increases,
β̃ = β̂i /r will approach the true value β . For any ε > 0, P ∥β̃ − β∥ ≥ ε → 0 as n → ∞. By Chebyshev’s inequality,
r
i =1
  E ∥β̃ − β∥
P ∥β̃ − β∥ ≥ ε ≤ ,
ε
indicating that E ∥β̃ − β∥ → 0 is sufficient to verify consistency. We observe that

 
1  r  1 r
∥β̃ − β∥ =  β̂i − β  ≤ ∥β̂ − β∥.
 
 r i =1  r i =1 i
Thus, we only need to study the error bound, ∥β̃ − β∥.

Below is a brief definition of a stable solution and a relational preparation lemma.
Definition 4.1. β̂ is the stable solution if an error bound of the form
∥β̂ − β∥ ≤ ϵ p · ∥β∥
is satisfied, where β is the true solution, β̂ is the computed solution, p is the rank of X , and ε > 0.
We initially require a preliminary lemma with fixed Ei , which is due to Stewart.
Lemma 4.1. Define two subsets C0 and C1 of Rn×1 as:

C0 = {z : z = X w, ∥z ∥ = 1}
and
C1 = {z : X T Ei z = 0}.
Then, C0 ∩ C1 = ∅.
A preliminary theorem of the projection bound is provided below before theorems of the error bound are discussed.
Theorem 4.2. Considering the PMLE of MLRMs in (1.1), we denote the following:
Qi = (X T Ei X )− X T Ei , Pi = X (X T Ei X )− X T Ei = XQi , i = 1, . . . , r .
Then constants CX and C̄X exist such that
∥Qi ∥ ≤ CX ; ∥Pi ∥ ≤ C̄X , i = 1, . . . , r .
The theorem on error bounds in MLRMs is as follows.
Theorem 4.3. For β̂i in (2.3) and β̂ in (1.3), β̂ is the stable solution of (1.1). Assume that
∥(X T Ei X )− X T Ei − (X T X )−1 X T ∥ ≤ ϵ · ci · ∥β∥, i = 1, . . . , r ,
where ci is constant. Then,
 
∥β̃ − β∥ r
≤ ϵ · p + ∥Y ∥ · ci .
∥β∥ i=1
Similarly, the error bound has a corollary as below.
Corollary 4.4. For β̂i in (2.3) and β̂ in (1.3), β̂ is the stable solution of (1.1). Assume that
∥(X T Ei X )− X T Ei − (X T X )−1 X T ∥ ≤ ϵ · ∥β∥, i = 1, . . . , r .
Then
∥β̃ − β∥
≤ ϵ · (p + ∥Y ∥).
∥β∥
Now, we obtain a theorem and a corollary for the error bound as follows.
Theorem 4.5. For β̂i in (2.3) and β̂ in (1.3), β̂ is the stable solution of (1.1). If
∥Y − X β∥ ≤ ϵ · ∥X ∥ · ∥β∥,
then
∥β̃ − β∥
≤ ϵ[(1 + p)∥(X T Ei X )− ∥ · ∥X ∥2 + p].
∥β∥
Corollary 4.6. For β̂i in (2.3) and β̂ in (1.3), β̂ is the stable solution of (1.1). If
∥Y − X β∥ ≤ ϵ · ∥X ∥ · ∥β∥,
Table 1
Consistency check with n0 = 32, p = 8, r = 16.
n εrn Time cost (s) Iterations
32 6.02 × 10−3 0.02 215

64 1.47 × 10−3 0.03 223
128 8.65 × 10−4 0.06 229
256 4.38 × 10−4 0.12 235
512 1.56 × 10−4 0.25 238
1024 7.91 × 10−5 0.53 243
2048 2.67 × 10−5 1.12 249
4096 5.86 × 10−6 2.25 257
then
∥β̃ − β∥
≤ ϵ[(1 + p)∥(X T Ei X )− ∥ · ∥X ∥2 + p + CX · ∥X ∥].
∥β∥
5. Experimental studies on the PMLE in MLRMs
In this section, we validate the effectiveness of the PMLE method through experiments. The experiments were deployed
on a 16-node Beowulf cluster, each with dual 3 GHz inter CPUs and a 3 GB memory. We ran the R program and the Rmpi in
Redhat Linux 6.2 using MPICH2 and SNOW implementation. We design a series of consistency experiments to confirm that
the PMLE method was working correctly. We then detected the ratios of the outliers in the MLRMs, using our PMLE method.
In addition, we tested the scalability performance of the method, which was then used to fit with an actual large data set.
5.1. Consistency experiments of the PMLE in MLRMs
The theorems in Section 4 show that the PMLE is consistent for MLRMs. We let
r
1
εrn = εrn (X1 , . . . , Xr ) = ∥β̂i − β∥,
r i=1
which is an average error bound. When r is fixed, a decrease in εrn for increasing n indicates that consistency holds, which
proves that the PMLE yields a consistent solution.
The problem size is determined by the following dimensions: the sample size n, subsample size n0 , number of parameters
(0) (0)
p, and the number of MPI processes r. With an initial solution of β (0) = (β1 , . . . , βp ), we reached the selected routine
runs and obtained a PMLE. β̃ is obtained by β̂i (i = 1, . . . , r ).
Experiment samples were generated and stored for each distinct setting of (n, n0 , p), for the maximum number of r,
ensuring that any two runs with the same dimensions (n, n0 , p, r) used exactly the same sample. Thus, their results were
directly comparable. All runs were distributed in the same manner across the cluster.
Table 1 demonstrates the consistency of the PMLE methods, with n0 , p, r fixed and n varying from a moderate to a large
sample size. The value of εrn is displayed, along with the number of PMLE iterations and time cost (in seconds). The time costs
include reading data, setting up problem, initial communication. Clearly, εrn decreases as the sample size n increases. There-
fore, the computed estimates appear consistent. Furthermore, the time cost roughly doubles as the sample size doubles,
whereas the number of PMLE iterations slowly increases.
Table 2 shows the results that are similar to those in Table 1, but with n, n0 , p fixed and r changing individually. The
quality of solutions is shown as dimensions that are varied. An increase in r causes the expected increase in time cost. In
Table 2(a), εrn also decreases as the subsample size n0 increases. The time cost and the number of iterations increase for
a large n0 . Table 2(b) shows a dramatic increase in computational time as p doubles. The number of PMLE iterations also
increases, whereas εrn appears to be unaffected. Table 2(c) shows that changing r does not significantly affect the solutions.
If n = 128 and the subsample size is also 128, then it does not make sense to use 16 MPI processes.
5.2. Outlier experiments for the PMLE in MLRMs
The general MLEs were slightly less precise in the presence of outliers (an outlier is an observation that is numerically
distant from the rest of the data). Thus, studying the ratio of outliers is necessary to reduce the outliers. In this study, the
number of parameters p, and MPI processes r were fixed. If Nn0 is the number of different outliers in the subsample, then
Nn is the number of different outliers in the sample. The ratio of outliers is ρN = Nn0 /Nn . Tables 3–5 provide the ratios of
outliers for the different outlier percentages on p = 10 and r = 8.
In Tables 3–5, the PMLE methods have low outlier ratios with the large sample sizes and the small subsample sizes, but
compromise outlier detection with the small sample sizes and the large subsample sizes. In this case, the optimal relation-
ship between n and n0 is provided. Tables 3–5 show that the optimal relation bounds are n/n0 ∈ [8, 32], n/n0 ∈ [8, 64] and
Table 2
Solution quality varying n0 , p, r.
(a) Vary n0 , using n = 128, p = 32, r = 16
n0 εrn Time cost Iterations
1 5.95 × 10 −1
0.02 129
2 4.87 × 10−1 0.04 113
4 3.98 × 10−1 0.07 109
8 2.76 × 10−1 0.15 117
16 1.53 × 10−1 0.32 121
32 9.83 × 10−2 0.63 127
64 6.79 × 10−2 1.19 136
128 3.68 × 10−2 2.41 147
(b) Vary p, using n = 256, n0 = 64, r = 16
p εrn Time cost Iterations
4 1.21 × 10−2 0.14 110

8 1.18 × 10−2 0.27 124
16 1.25 × 10−2 0.53 153
32 1.15 × 10−2 1.07 169
64 1.22 × 10−2 2.13 185
(c) Vary r, using n = 256, n0 = 64, p = 32
r εrn Time cost Iterations
1 5.58 × 10 −3
0.09 194
2 5.72 × 10−3 0.15 193
4 6.08 × 10−3 0.29 196
8 6.10 × 10−3 0.54 195
16 5.79 × 10−3 1.07 195
32 5.61 × 10−3 2.11 196
Table 3
5% outlier of Y ’s in MLRMs (ρN ).
n n0
5 10 20 40 80 160 320
20 0 1 1 – – – –
40 0.50 0.50 1 1 – – –
80 0.50 0.50 0.75 0.75 1 – –
160 0.25 0.38 0.50 0.75 0.87 1 –
320 0.13 0.23 0.31 0.45 0.73 0.87 1
640 0.06 0.13 0.23 0.35 0.46 0.75 0.97
1280 0.03 0.13 0.16 0.26 0.38 0.45 0.78
2560 0.02 0.06 0.13 0.18 0.24 0.37 0.47
Table 4
n n0
5 10 20 40 80 160 320
20 1 1 1 – – – –
40 0.50 0.50 1 1 – – –
80 0.50 0.50 0.75 0.75 1 – –
160 0.25 0.38 0.50 0.75 0.87 1 –
320 0.25 0.31 0.45 0.50 0.75 0.94 1
640 0.13 0.23 0.34 0.46 0.52 0.78 0.97
1280 0.06 0.16 0.26 0.38 0.45 0.51 0.89
2560 0.03 0.07 0.15 0.24 0.37 0.47 0.52
n/n0 ∈ [16, 64], respectively. From the change of n and n0 , the PMLE methods are the effective methods for outlier detection
in the samples (an outlier detection technique can also be perceived as testing if an instance is generated by that model or
not).
5.3. Scalability experiments for the PMLE in MLRMs
We consider the parallel performance of the PMLE problem when varying the number of MPI processes r. We examined
the time cost and the metrics’ ‘‘efficiency’’, which is conventionally provided in performance studies. Let c ∈ {n, n0 , p} be
Table 5
n n0
5 10 20 40 80 160 320
20 1 1 1 – – – –
40 0.50 0.50 1 1 – – –
80 0.50 0.50 0.75 0.75 1 – –
160 0.25 0.38 0.50 0.75 0.87 1 –
320 0.25 0.38 0.48 0.50 0.82 0.94 1
640 0.18 0.23 0.38 0.46 0.52 0.78 0.97
1280 0.09 0.19 0.29 0.38 0.48 0.54 0.81
2560 0.05 0.09 0.18 0.26 0.38 0.49 0.55
Table 6
Time cost and efficiency for the PMLE of MLRMs.
(a) Time cost (s)
Processing nodes (r) 1 2 4 8 16
Model 1 (p = 3) 5.74 3.08 1.66 1.01 0.64

Model 2 (p = 10) 9.28 4.91 2.75 1.61 0.98
Model 3 (p = 50) 46.2 23.9 12.2 6.18 3.12
(b) Observed efficiency (Er )
Processing nodes (r) 1 2 4 8 16
Model 1 (p = 3) 1 0.93 0.77 0.71 0.54
Model 2 (p = 10) 1 0.95 0.75 0.72 0.59
Model 3 (p = 50) 1 0.966 0.98 0.987 0.99
Note: n = 1.6 × 106 and subsample size n0 = 50000.
Table 7
Multiple R-squared of ‘‘lm’’ function with chosen subset in Bank32nh.
Subset 50:100 100:150 150:200 200:250 250:300 300:350 350:400

MR-sq 0.8221 0.8465 0.7848 0.7849 0.7265 0.8275 0.861
MR-sq represents multiple R-squared.
an experiment variable under observation. Define Tr (c ) as the time cost in seconds to compute a problem of size c using
r processes. The speedup is defined as Sr (c ) = T1 (c )/Tr (c ), where the Sr (c ) close to r suggests an ideal parallel perfor-
mance. Efficiency is defined as Er (c ) = Sr (c )/r, where Er (c ) close to 1 suggests an ideal performance. The same sample was
used whenever c was constant and when the number of processes r varied. This step helps simplify comparisons between
different r. Parallel runs ensure that the results matched.
The samples were generated from the MLRM Y1 = β1 x1 + β2 x2 + β3 x3 for our first simulation. In another simulations, the
l=1 βl xl ; Y3 = l=1 βl xl . Here xl ∼ N (µl , σl ). µl , σl are different. βl (l = 1, . . . , 10)
10 50 2 2
samples were generated from Y2 =
are unknown parameters.
Table 6 reports the execution time using the sample sizes of n = 1.6 × 106 . On one hand, the subsample size n0 = 50000
obtained a speed-up that is approximately linear to the number of our available nodes. On the other hand, an important
factor is the rank of X , namely, p, execution time increases when p increases. In addition, increasing the number of nodes r,
one can decrease the time cost for large data sets.
Table 6 shows the results of the simulations with varying r. For each fixed n, increasing the number of processes r strongly
affects the time cost. Particularly, there is a nice result in p = 50.
In Table 6(a), for p = 3, 10 and 50, doubling r almost halves the time cost. As r increases, the run time significantly
decreases. Thus, our method significantly reduces the time cost for large enough dimension sizes, especially p = 50.
Table 6(b) shows the efficiency when r varies and when p = 3, 10 and 50. Time cost is indefinitely reduced by half as r
doubles. The optimal values are r = 4 for p = 3, r = 8 for p = 10 and r = 16 for p = 50.
5.4. Actual data experiments for the PMLE in MLRMs
We examined whether the advantages of the proposed method were still valid under actual data sets. For this purpose,
we use a sample bankruptcy data from [9]. In this data set, Bank32nh included 4500 observed samples, with 31 continuous
attributes and two-dimensional output values (mxql and rej). MLE was obtained by fitting Bank32nh using the function lm in
R software. The multiple R-squared was 0.4156, the F-statistic was 102.5 on 31, and 4468 DF about mxql. Subsets of Bank32nh
were then selected to examine the parallel maximum likelihood. Let r = 7 and rank(Ei ) = n0 = 51 in Eq. (2.3), where the
matrices Ri are fixed. If we choose r the another more than 7, for example, r = 16, we have the same result with Table 6
in time cost. Table 7 shows a series of multiple R-squared values of these subset maximum likelihoods and an estimator of
mxql.
mxql = 4.61749 − 0.07253a1cx + 0.2806a1cy − 0.2821a1sx + 0.1796a1sy + 0.15413a1rho
− 0.04706a1pop + 0.02655a2cx − 0.05268a2cy − 0.05536a2sx + 0.1238a2sy
+ 0.3376a2rho − 0.03203a2pop + 0.20188a3cx + 0.2771a3cy − 0.05074a3sx
− 0.1831a3sy + 0.38922a3rho + 0.07905a3pop − 0.3564temp + 0.05736b1x
+ 0.2942b1y + 0.25223b1call − 0.24556b1eff − 0.1871b2x + 0.034925b2y + 0.282143b2call
− 0.18141b2eff − 0.25519b3x − 0.23129b3y + 0.312347b3call + 0.054851b3eff .
We obtained the above PMLE of mxql for 31 attributes, which was a weighted least-squares estimator for these subset
estimators with 1/7 as their weights. The statistical property of our estimator was the same as that of the WLS estimator. At
the same time, Table 7 indicates that, each subset estimator multiple R-squared is larger than the multiple R-squared value
of MLE.
A suitable subset can be found if we have a PMLE of mxql for the 4500 observed samples. In particular, we can choose
r = 1, rank(Ei ) = 51 and subset = 350:400. Thus, the multiple R-squared value of our method is 0.861, which is much
larger than that of the population. It is seen that the effective of the method is related to the chosen subset.
6. Discussion and future research
This paper provided the PMLE method in MLRMs as well as several of its related properties and computational efficiency.
Effective estimation and run-time for large data sets were achieved using the methods to fit these models.
The PMLE method is effective in computing the MLRMs. The effects of adjusting sample size, subsample size, and the
number of parameters were studied through simulations. Increasing the sample size verified the consistency of the PMLE
methods. An increase in the number of parameters quickly increased the run-time. Increasing the subsample size similarly
increased the run-time and improved the quality of our methods. Varying the sample and subsample sizes showed excellent
performance. Despite the increased run-time, a high number of parameters allowed for optimum performance. Therefore,
the PMLE method with an extremely high number of parameters would be infeasible to maximize, even on a large cluster.
In addition, outlier experiments were presented to obtain the optimal relationship between the sample and the subsam-
ple sizes. Finally, the PMLE method was used in a more realistic analysis, yielding a significant improvement in performance
using only a few computation nodes. The method can be applied to statistical computations in general; in particular, we
computed the PMLE method for the MLRMs.
For future research, we will study other asymptotical properties of the PMLE method with a working matrix, tolerance
region, choice of block length in the novel method, case in Wishart matrix, computational cost, and so on. The best subset
can be selected using certain rules, such as AIC and BIC, which are important aspects in future study.
For further reading
[10], [11], [12], [13], [14], [15], [16] and [17].
Acknowledgments
We would like to thank Prof. M. J. Goovaerts for some useful help and suggestions. This work was supported by the NSFC
under grant 10921101, 91130003, 11171189 and 11326183, China Postdoctoral Science Foundation (135569), the NBSC
under grant 2012LY017, and Shan-dong Natural Science Foundation (ZR2011AZ002).
Appendix. Technical proofs
In this section, we collect some of the technical proofs.

Proof of Proposition 2.1. Observe that rank( Erj ) = n, l < r and Eq. (3.2). Let
l
j =1
  −1
l

Σ0 = Erj ,
j =1
then
    −1  
r l l
1  
β̃ = (X Ei X ) X Ei Y = X
T − T T
Erj X X T
Erj Y = (X T Σ0−1 X )−1 X T Σ0−1 Y .
r i=1 j=1 j =1
Thus we get the equivalence proposition. The above method is a WLS estimator since Σ0 is a diagonal matrix.
Proof of Theorem 3.1. For {Pi = X (X T Ei X )− X T Ei , i = 1, 2, . . . , r }, note that
X (X T Ei X )− X T Ei
 
X
rank(X (X Ei X ) X Ei ) = rank
T − T
− rank(Ei X )
0 Ei X
 
0 X
= rank − rank(Ei X )
Ei X (X Ei X )− X T Ei
T
0
= rank(X ) + rank(Ei X (X T Ei X )− X T Ei ) − rank(Ei X ).

We then have
max{rank(Pi )} = max{rank(X (X T Ei X )− X T Ei )}
i i
= rank(X ),
and
min{rank(Pi )} = min{rank(X (X T Ei X )− X T Ei )}
i i
= min{rank(Ei X (X T Ei X )− X T Ei )}.
i
It is easy to see that

    
I 0 0 0
=
0 XT Ei X (X T Ei X )− X T Ei X T Ei X (X T Ei X )− X T Ei
 
0
= ,
X T Ei
and
    
I 0 0 0
= .
0 Ei X (X T Ei X )− T
X Ei Ei X (X Ei X )− X T Ei
T
These imply that

   
0 0
rank = rank .
Ei X (X T Ei X )− X T Ei X T Ei
It follows readily that

min{rank(Pi )} = min rank(Ei X ),
i i
min{Ip − rank(Pi )} = p − max{rank(Pi )} = p − rank(X ).

i i
Proof of Theorem 3.2. Write

ν2
 2 
ν12
 
Λ= , Λ1 =  .. .
.
 
Λ1
νp2
We derive the bounds by projecting XiT Xi onto a 2 × 2 matrix with eigenvalues ν12 and ν22 .
We start with the positive semi-definite for XiT Xi . If z is an eigenvector such that
(X T X )z = λmin (X T X )z , ∥z ∥ = 1.
p1 z
   
z1
= = P̃ T z ,
z2:p T
P̃1 z
then
λmin (X T X ) = z T (X T X )z = z2T:p Λ1 z2:p + ν12 |z1 |2 + |yT z |2
≥ ν22 ∥z2:p ∥2 + ν12 |z1 |2 + |z2:p y2:p + z1 y1 |2
 ν12
     
0 y1  z1
.
 
z
= 1 2:p z + y1 y2:p
0 ν22 In−1 y2:p z1:p−1
 
z1
Let Q be a matrix of order p − 1 such that Qy2:p = ∥y2:p ∥ep−1 and set w = Qz2:p
, where ∥w∥ = 1. Then
ν12
    
0 y1
λmin (X T X ) ≥ w T yT2:p w
 
+ y1
0 ν2
2 Ip−1
y2:p
 
L+ 0
≥ λmin = min{ν22 , λmin (L+ )}.
0 ν22 In−2
Applying (3.1) to L+ gives ν12 ≤ λmin (L+ ) ≤ ν22 , and
λmin (X T X ) ≥ min{ν22 , λmin (L+ )} = λmin (L+ ).

Now consider the negative semi-definite for XiT Xi , and let z be an eigenvector such that
(XiT Xi − V T V )z = λmin (XiT Xi − V T V )z , ∥z ∥ = 1.

As above one shows λmin (XiT Xi − V T V ) ≥ min{ν22 , λmin (L− )}. Applying (3.2) to L− gives ν12 − ∥y∥2 ≤ λmin (L− ) ≤ ν12 , and
λmin (XiT Xi − V T V ) ≥ min{ν22 , λmin (L− )} = λmin (L− ).

Since U± are the respective trailing 2 × 2 principal submatrices of P̃ T (XiT Xi ± V T V )P̃, applying (3.1) to U+ and (3.2) to U−
gives λmin (U− ) ≤ ν22 and λmin (U+ ) ≤ ν12 . This proves the theorem.
Proof of Lemma 4.1. Suppose there is a z ∈ C0 ∩ C1 , such that z = X w for some w and ∥z ∥ = 1, X T Ei z = 0, thus
w T X T Ei z = 0,
i.e. z T Ei z = 0. This proves the lemma.
Proof of Theorem 4.2. For fixed i, call
w = Qi βi .
Let z = X w , the goal is to get an upper bound on z in terms of βi . Note that
X T Ei (βi − z ) = 0.
Recall that v = βi − z, by Lemma 4.1, so that X T Ei v = 0. Set t = 1/∥z ∥, we then have
t v + tz = t βi .
Note that t v is in C1 and −tz is in C0 . The norm of the left-hand side of this equation is at least ρ . Thus ∥t βi ∥ ≥ ρ , i.e.
∥βi ∥
∥z ∥ ≤ .
ρ
Observe that
∥Pi ∥ ≤ ∥X ∥ · ∥Qi ∥,
∥Qi ∥ = ∥(X T X )−1 X T X (X T Ei X )− X T Ei ∥
≤ ∥(X T X )−1 X T ∥ · ∥X (X T Ei X )− X T Ei ∥.
This completes the proof of the theorem.
Proof of Theorem 4.3. According to Wilkinson [18], β̂ is the stable solution, then ∥β̂ − β∥ ≤ ϵ · p · ∥β∥. We have
∥β̂i − β̂∥ = ∥(X T Ei X )− X T Ei Y − (X T X )−1 X T Y ∥

≤ ∥(X T Ei X )− X T Ei − (X T X )−1 X T ∥ · ∥Y ∥
≤ ϵ · ∥β∥ · ci · ∥Y ∥, i = 1, . . . , r .
Moreover, ∥β̂i − β∥ ≤ ∥β̂i − β̂∥ + ∥β̂ − β∥, we have
 
1  r  1 r
∥β̃ − β∥ =  (β̂i − β) ≤ (∥β̂i − β∥)
 
 r i=1  r i =1
r
1
≤ (∥β̂i − β̂∥ + ∥β̂ − β∥)
r i =1
 
r

≤ ϵ · ∥β∥ · ∥Y ∥ · ci + p .
i=1
Thus
 
∥β̃ − β∥  r
≤ ϵ · p + ∥Y ∥ · ci ,
∥β∥ i =1
as claimed.
Proof of Theorem 4.5. Note that

X T Ei X (β̂ − β̂i ) = X T (I − Ei )(Y − X β̂).
Now we have
β̂ − β̂i = (X T Ei X )− X T (I − Ei )(Y − X β̂).
Hence,
∥β̂ − β̂i ∥ = ∥(X T Ei X )− X T (I − Ei )(Y − X β̂)∥
≤ ∥(X T Ei X )− ∥ · ∥X T ∥ · ∥(I − Ei )∥ · ∥Y − X β̂∥
= ∥(X T Ei X )− ∥ · ∥X T ∥ · ∥Y − X β̂∥
≤ ∥(X T Ei X )− ∥ · ∥X ∥ · (∥Y − X β∥ + ∥X ∥ · ∥β − β̂∥)
≤ ∥(X T Ei X )− ∥ · ∥X ∥ · (ϵ · ∥X ∥ · ∥β∥ + ∥X ∥ · ϵ p · ∥β∥)
≤ ϵ · ∥β∥ · ∥(X T Ei X )− ∥ · ∥X ∥2 · (1 + p).
Observe that
 
1  r  1 r
∥β̃ − β∥ =  (β̂i − β) ≤ (∥β̂i − β∥)
 
 r i=1  r i =1
r
1
≤ (∥β̂i − β̂∥ + ∥β̂ − β∥)
r i=1
≤ ϵ · ∥β∥ · ∥(X T Ei X )− ∥ · ∥X ∥2 · (1 + p) + ϵ p · ∥β∥

= ϵ · ∥β∥ · [(1 + p)∥(X T Ei X )− ∥ · ∥X ∥2 + p].
We then have
∥β̃ − β∥
≤ ϵ[(1 + p)∥(X T Ei X )− ∥ · ∥X ∥2 + p].
∥β∥
This proves the theorem.
Proof of Corollary 4.6. Observe that
∥(X T Ei X )− X T (I − Ei )(Y − X β̂)∥ ≤ ∥(X T Ei X )− X T (Y − X β̂)∥ + ∥(X T Ei X )− X T Ei (Y − X β̂)∥.

By Theorems 4.2 and 4.5, note that
∥β̂ − β̂i ∥ = ∥(X T Ei X )− X T (I − Ei )(Y − X β̂)∥

≤ ∥(X T Ei X )− ∥ · ∥X T ∥ · ∥Y − X β̂∥ + ∥(X T Ei X )− X T Ei ∥ · ∥(Y − X β̂)∥
≤ ϵ · ∥β∥ · [(1 + p)∥(X T Ei X )− ∥ · ∥X ∥2 + p] + ϵ CX · ∥X ∥ · ∥β∥.
Therefore we have the theorem.
References
[1] B. Eksioglu, R. Demirer, I. Capar, Subset selection in multiple linear regression: a new mathematical programming approach, Comput. Ind. Eng. 49
(2005) 155–167.
[2] T.J. Mitchell, J.J. Beauchamp, Bayesian variable selection in linear regression, J. Amer. Statist. Assoc. 83 (1988) 1023–1032.
[3] T. Havránek, Z. Stratkoš, On practical experience with parallel processing of linear models, Bull. Int. Statist. Inst. 53 (1989) 105–117.
[4] M. Xu, E. Wegman, J. Miller, Parallelizing multiple linear regression for speed and redundancy: an empirical study, J. Stat. Comput. Simul. 39 (1991)
205–214.
[5] J. Skvoretz, S. Smith, C. Baldwin, Parallel processing applications for data analysis in the social sciences, Concurrency, Pract. Exp. 4 (1992) 207–221.
[6] R. Bouyouli, K. Jbilou, R. Sadaka, H. Sadok, Convergence properties of some block Krylov subspace methods for multiple linear systems, J. Comput.
Appl. Math. 196 (2006) 498–511.
[7] E. Wegman, On Some Statistical Methods for Parallel Computation, in: Statistics Textbooks and Monographs, vol. 184, 2006, pp. 285–306.
[8] G. Guo, Parallel statistical computing for statistical inference, J. Stat. Theory Pract. 6 (2012) 536–565.
[9] D.P. Foster, A.S. Robert, Variable selection in data mining: building a predictive model for bankruptcy, J. Amer. Statist. Assoc. 99.466 (2004) 303–313.
[10] R. Coppi, P. D’Urso, P. Giordani, A. Santoro, Least squares estimation of a linear regression model with LR fuzzy response, Comput. Statist. Data Anal.
51 (2006) 267–286.
[11] G. Guo, Schwarz methods for quasi-likelihood in generalized linear models, Comm. Statist. Simulation Comput. 37 (2008) 2027–2036.
[12] G. Guo, S. Lin, Schwarz method for penalized quasi-likelihood in generalized additive models, Comm. Statist. Theory Methods 39 (2010) 1847–1854.
[13] G. Guo, W. Zhao, Schwarz methods for quasi stationary distributions of Markov chains, Calcolo 49 (2012) 21–39.
[14] M. Schervish, Applications of parallel computation to statistical inference, J. Amer. Statist. Assoc. 83 (1988) 976–983.
[15] A. Stamatakis, RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models, Bioinformatics 22 (2006)
2688–2690.
[16] S.J. Steel, D.W. Uys, Influential data cases when the Cp criterion is used for variable selection in multiple linear regression, Comput. Statist. Data Anal.
50 (2006) 1840–1854.
[17] Y. Tian, Y. Takane, Some properties of projectors associated with the WLSE under a general linear model, J. Multivariate Anal. 99 (2008) 1070–1082.
[18] J.H. Wilkinson, The Algebraic Eigenvalue Problem (Vol. 87), Clarendon Press, Oxford, 1965.

1 s2.0 S0377042714002775 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0377042714002775 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0377042714002775 Main

Uploaded by

Copyright:

Available Formats

Journal of Computational and Applied Mathematics 273 (2015) 251–263

Contents lists available at ScienceDirect

Journal of Computational and Applied

Parallel maximum likelihood estimator for multiple linear

article info abstract

Here q is the number of matrices Ei with a nonzero in the row.

Fig. 1. r is the number of processor; n0 is the length of every subsample.

If l exists such that {Erj , j = 1, . . . , l} ∈ {Ei , i = 1, . . . , r }, and

Remark. In Eq. (2.6), if {Erj , j = 1, . . . , l} satisfies

the PMLE of MLRMs is a MLE.

3. Properties of the PMLE in MLRMs

3.1. Rank of projections for the PMLE in MLRMs

min{rank(Pi )} = min rank(Ei X );

min{Ip − rank(Pi )} = p − rank(X ).

3.2. Eigenvalue of the PMLE in MLRMs

For a fixed Xi and a fixed Ei in (2.1), we seek the eigenvalue bounds of

where P is an orthogonal matrix, λ21 ≤ · · · ≤ λ2p .

where P̃ is an orthogonal matrix, and ν12 ≤ · · · ≤ νp2 .

Let P̃ = (p1 , P̃1 ) with P̃1 = (p2 , . . . , pp ). If ν12 > 0, we have

ν12 ≤ λ21 ≤ ν22 , (3.1)

Theorem 3.2. If yyT = V T V , y = (y1 , . . . , yp , )T and

then, λmin (L± ) ≤ λmin (XiT Xi ± V T V ) ≤ λmin (U± ), where

We can improve Weyl’s theorem for the largest eigenvalue of X T X by

νp2 ≤ λmax (X T X ) ≤ νp2 + ∥y∥2 ,

Theorem 3.3. Let yyT = V T V , y = (y1 , . . . , yp )T and

4. Consistency of the PMLE in MLRMs

In this section, we present several theoretical studies on the consistency

indicating that E ∥β̃ − β∥ → 0 is sufficient to verify consistency. We observe that

Thus, we only need to study the error bound, ∥β̃ − β∥.

Definition 4.1. β̂ is the stable solution if an error bound of the form

Lemma 4.1. Define two subsets C0 and C1 of Rn×1 as:

The theorem on error bounds in MLRMs is as follows.

Similarly, the error bound has a corollary as below.

32 6.02 × 10−3 0.02 215

5. Experimental studies on the PMLE in MLRMs

5.1. Consistency experiments of the PMLE in MLRMs

5.2. Outlier experiments for the PMLE in MLRMs

4 1.21 × 10−2 0.14 110

5.3. Scalability experiments for the PMLE in MLRMs

Model 1 (p = 3) 5.74 3.08 1.66 1.01 0.64

Note: n = 1.6 × 106 and subsample size n0 = 50000.

Subset 50:100 100:150 150:200 200:250 250:300 300:350 350:400

5.4. Actual data experiments for the PMLE in MLRMs

6. Discussion and future research

For further reading

[10], [11], [12], [13], [14], [15], [16] and [17].

Appendix. Technical proofs

In this section, we collect some of the technical proofs.

Proof of Theorem 3.1. For {Pi = X (X T Ei X )− X T Ei , i = 1, 2, . . . , r }, note that

= rank(X ) + rank(Ei X (X T Ei X )− X T Ei ) − rank(Ei X ).

It is easy to see that

These imply that

It follows readily that

min{Ip − rank(Pi )} = p − max{rank(Pi )} = p − rank(X ).

Proof of Theorem 3.2. Write

λmin (X T X ) ≥ min{ν22 , λmin (L+ )} = λmin (L+ ).

(XiT Xi − V T V )z = λmin (XiT Xi − V T V )z , ∥z ∥ = 1.

λmin (XiT Xi − V T V ) ≥ min{ν22 , λmin (L− )} = λmin (L− ).

∥β̂i − β̂∥ = ∥(X T Ei X )− X T Ei Y − (X T X )−1 X T Y ∥