Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Multivariate Behavioral Research

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/hmbr20

Large-Scale Survey Data Analysis with Penalized


Regression: A Monte Carlo Simulation on Missing
Categorical Predictors

Jin Eun Yoo & Minjeong Rho

To cite this article: Jin Eun Yoo & Minjeong Rho (2021): Large-Scale Survey Data Analysis with
Penalized Regression: A Monte Carlo Simulation on Missing Categorical Predictors, Multivariate
Behavioral Research, DOI: 10.1080/00273171.2021.1891856

To link to this article: https://doi.org/10.1080/00273171.2021.1891856

Published online: 11 Mar 2021.

Submit your article to this journal

Article views: 170

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=hmbr20
MULTIVARIATE BEHAVIORAL RESEARCH
https://doi.org/10.1080/00273171.2021.1891856

Large-Scale Survey Data Analysis with Penalized Regression: A Monte Carlo


Simulation on Missing Categorical Predictors
Jin Eun Yoo and Minjeong Rho
Department of Education, Korea National University of Education

ABSTRACT KEYWORDS
With the advent of the big data era, machine learning methods have evolved and prolifer- Penalized regression;
ated. This study focused on penalized regression, a procedure that builds interpretive pre- machine learning; large-
diction models among machine learning methods. In particular, penalized regression scale data; survey data;
missing data
coupled with large-scale data can explore hundreds or thousands of variables in one statis-
tical model without convergence problems and identify yet uninvestigated important pre-
dictors. As one of the first Monte Carlo simulation studies to investigate predictive
modeling with missing categorical predictors in the context of social science research, this
study endeavored to emulate real social science large-scale data. Likert-scaled variables
were simulated as well as multiple-category and count variables. Due to the inclusion of the
categorical predictors in modeling, penalized regression methods that consider the group-
ing effect were employed such as group Mnet. We also examined the applicability of the
simulation conditions with a real large-scale dataset that the simulation study referenced.
Particularly, the study presented selection counts of variables after multiple iterations of
modeling in order to consider the bias resulting from data-splitting in model validation.
Selection counts turned out to be a necessary tool when variable selection is of research
interest. Efforts to utilize large-scale data to the fullest appear to offer a valid approach to
mitigate the effect of nonignorable missingness. Overall, penalized regression which
assumes linearity is a viable method to analyze social science large-scale survey data.

Introduction components analysis) is necessary before main ana-


lysis, which results in information loss and/or diffi-
In the pronounced era of “big data,” researchers have
cult-to-interpret models. Even beyond wide data sets,
more access to data of various structures or large vol-
ML can handle unstructured data such as image, text,
umes than ever, and an increasing number of social
sound, or sensor data, and has been proposed as the
scientists have been paying attention to novel techni-
“official” tool for analyzing big data (Alsheikh et al.,
ques and analytics such as machine learning (ML1).
2014; Boiy & Moens, 2009; Zhang & Ma, 2012).
Strengths of ML include the following. First, ML is Second, inferential models from traditional statis-
well-known for its versatility in data analysis. tical methods neglect prediction in favor of explan-
Traditional statistical methods are optimal for analyz- ation (Shmueli, 2010). Even when the goal is
ing structured, long data (i.e., more observations than prediction, models based on hypothesis testing are
variables) (Bzdok et al., 2018) stored in conventional generally explanatory. Inferential models explain the
spreadsheet format. On the other hand, social scien- current data well, but their utility with new data is
tists are often confronted with wide data (i.e., more limited. Considering generalization as the
variables than observations) (Bzdok et al., 2018), for “desideratum” in social science research (Campbell &
instance when listwise deletion is employed in large- Stanley, 1963, p. 5, as cited in Shadish et al., 2002),
scale data analysis. To adequately analyze the wide we should not neglect predictive models as they lead
data with traditional methods, employing techniques to generalization. From the perspective of bias-vari-
such as data aggregation (e.g., variable collapse via ance tradeoff in statistical learning, traditional inferen-
averaging) or feature reduction (e.g., principal tial models are more likely to lead to overfitting and

CONTACT Minjeong Rho [email protected] Department of Education, Korea National University of Education, Cheong-ju, Republic of Korea.
1
Following the convention of machine learning literature, ML indicates machine learning throughout this manuscript. To avoid confusion, maximum
likelihood estimation is abbreviated as MLE.
ß 2021 Taylor & Francis Group, LLC
2 J. E. YOO AND M. RHO

thus more prediction errors than predictive models coefficients of penalized regression can be interpreted
(Hastie et al., 2009). When overfitting occurs, data in the similar way as in unpenalized or traditional
eccentricities are reflected in model fitting, often regression thus yielding models that are interpretable
resulting in unnecessarily complex models with unsat- (Yoo & Rho, 2020). By contrast, other popular ML
isfactory predictive capability. By contrast, ML focuses methods such as deep learning, support vector
on building predictive models with holdout data tech- machines, and random forest consist of complex
niques or cross-validation, which improves predict- higher-order and/or interaction effects that make inter-
ability and generalizability of the resulting models pretation of the resulting models difficult
(Shmueli, 2010). or impossible.
Third, traditional research tends to be bounded by Lastly, missing data deserves attention. Missing data
established theories and/or existing literature. In trad- is a pervasive problem regardless of discipline, and par-
itional research, confirmatory methods (i.e., confirma- ticularly the approach of this study (including hun-
tory factor analysis, structural equation modeling) dreds of predictors from large-scale data in one model)
have been preferred over exploratory ones (i.e., can exacerbate the problem. For instance, the combin-
exploratory factor analysis, stepwise regression), and ation of listwise deletion and large-scale data can lead
data-driven methods are often criticized for lacking to reduced sample or even wide data. The results from
theoretical underpinnings. On the other hand, ML is reduced sample may be invalidated due to low power,
exploratory in nature. In particular, penalized regres- and traditional analysis methods have a difficulty in
sion can make possible methodological breakthroughs handling wide data. Nevertheless, not much is studied
from a different perspective; models from penalized regarding missing data techniques with large number
regression can provide novel insights to researchers by of incomplete predictors. The main goal of this study
revealing important variables heretofore uninvesti- was to investigate penalized regression with different
gated (Yoo, 2018; Yoo & Rho, 2020). missing data techniques in the context of social science
To summarize, ML methods are optimized for big large-scale survey data analysis.
data analysis with its versatility, but are also helpful to
create predictive models and answer exploratory ques-
Overview of penalized regression and missing
tions. In other words, ML serves as a reasonable alter-
native to traditional significance testing approaches,
data techniques
particularly when the data to analyze are wide and/or This section reviews penalized regression and missing
the purpose of the research is to yield predictive mod- data techniques selected for examination in this study.
els via data exploration. Specifically, we propose that The subsection on penalized regression highlights the
ML be applied to traditional large-scale survey data. three methods: group LASSO, group Enet (group elas-
Even in the big data era, survey methods are still one tic net), and group Mnet. The subsection on missing
of the most popular tools to collect data, but survey data techniques explains LD (listwise deletion), k-NN
data have been mostly analyzed with traditional statis- (k-nearest neighbors), and EM (expectation-maximiza-
tical methods. Among the options in ML, penalized tion) imputations.
regression was chosen as the focal method to examine
in the context of large-scale survey data analysis for the
Penalized regression
following reasons.
First, penalized regression has an underlying Penalized regression imposes a penalty on the object-
assumption of sparsity (Hastie et al., 2015). Sparsity ive function and diminishes some of the coefficient
assumes that the true model consists of a much smaller estimates. The goal of penalized regression is to lower
subset of large number of variables, and thus large- variances and ultimately to reduce mean squared
scale survey data are more likely to meet the sparsity errors by introducing a slight bias in the estimates.
assumption. Second, penalized regression can examine The earliest penalized regression method, ridge, was
hundreds or thousands of survey variables in one stat- designed to handle multicollinearity problems with
istical model, avoiding nonconvergence and overfitting. ordinary least squares. Ridge adds a penalty to the
As a result, penalized regression selects a subset of vari- diagonal elements of a singular X T X matrix to make
ables, some of which may reveal new, important rela- the matrix invertible (Hoerl & Kennard, 1970), but
tionships not yet investigated in previous research. does not perform variable selection.
Third, penalized regression is a linear model, typically Developed by Tibshirani (1996), least absolute
consisting of main-order effects. The regression selection and shrinkage operator (LASSO) is the first
MULTIVARIATE BEHAVIORAL RESEARCH 3

and one of the most popular penalized regression The objective functions of LASSO, Enet, and Mnet
methods thus far for selecting important variables. for a Gaussian family are shown in Equations (2)–(4),
LASSO also handles high-dimensional data, but its respectively. The same first term in the right-hand
estimates are known to be inconsistent in terms of side of the equations is the loss function of least
variable selection (Fan & Li, 2001; Huang et al., 2008; squares. LASSO uses L1 penalty, the sum of absolute
Meinshausen & B€ uhlmann, 2006; Zhao & Yu, 2006; values of coefficients as well as least squares (Equation
Zou, 2006). Among a variety of methods proposed to (2)). The tuning parameter k regularizes shrinkage of
solve the inconsistency issue of LASSO, MCP (mini- the coefficients. On top of LASSO, Enet has another
max concave penalty) is known to yield nearly con- tuning parameter, a, which controls the amount of
sistent coefficient estimates (Huang et al., 2016; ridge (Equation (3)). When a is 1, Equation (2)
Zhang, 2010). While LASSO uses a convex penalty reverts to the LASSO equation, and when a is 0, it is
that increases linearly regardless of the coefficient size, the same as ridge. Mnet has three tuning parame-
MCP uses a concave penalty that tapers off with larger ters: k1 , c, and k2 (Equation (4)). The tuning param-
coefficients in absolute values. eter k1 is a regularization parameter that controls the
Another consideration with LASSO is that it cannot amount of penalty. The c parameter, the concavity
properly handle collinear data (Zou & Hastie, 2005). penalty, regulates the penalization rate depending on
When hundreds or thousands of variables are consid- the size of the coefficients. While the penalty rate of
ered in one statistical model, multicollinearity can be LASSO is kept constant for all ranges of the coeffi-
a real problem. Incorporating ridge in the penalty cients, the penalty rate of Mnet quickly drops with
function solves this problem. Enet and Mnet are com- increasing coefficients in absolute values, thereby
binations of LASSO and ridge (Zou & Hastie, 2005) applying less shrinkage to the coefficients and yielding
and MCP and ridge (Huang et al., 2016), respectively. less-biased estimates than LASSO (Huang et al., 2016).
Both Enet and Mnet retain the useful features of Generally, a value of 2 is assigned to c (Zhang, 2010).
LASSO and MCP and handle multicollinearity due to On top of MCP, Mnet adds a ridge penalty to the
the inclusion of ridge in the penalty function. equation in order to handle multicollinearity, and the
More specifically, all the methods of this study penalty parameter for ridge is k2 : The ridge penalties
incorporate the grouping effect resulting from varia- of Enet and Mnet were set at 0.5 (Hastie & Qian,
bles of multiple-response categories. In a regression 2016).
model, a categorical variable should be treated as a X
K X
K
group after dummy coding. As a subgroup of regres- ^ L ¼ argmin kY 
b Xk bk k22 þ k jbk j: (2)
b
sion methods, penalized regression also needs to treat k¼1 k¼1
h
dummy-coded variables in the same way. Group ^ Enet ¼ argmin kY  PK Xk b k2 þ k PK ðakb k
b
LASSO, group Enet, and group Mnet were considered b k¼1 k k¼1 k

the appropriate methods to handle both categorical


þð1  aÞkbk k2 Þ: (3)
and continuous variables of large-scale survey data; h
these factors are referred to as LASSO, Enet, and ^ Mnet ¼ argmin kY  PK Xk b k2
b b k¼1 k
Mnet for brevity.
XK XK
LASSO, Enet, and Mnet can be represented by þ k¼1
Jðkb k kjk1 , cÞþ k 2 k¼1
kbk k2 ,
equations. Consider a linear regression model with p 8
predictors. Suppose the predictors are divided into K >
> 1
<  x2 þ k1 jxj, jxj  ck1 :
non-overlapping groups and the model is written as where J ðk1 , cÞ ¼ 2c
the following Equation (1). >
> 1 2
: ck1 , jxj > ck1 :
2
X
K
Y¼ Xk bk þ  (1) (4)
k¼1

Y is an n dimensional vector of a
Missing data techniques
response variable.
Xk is the n  pk design matrix of the pk predictors LD (listwise deletion)
in the kth group. If any of the variable values for a row is missing, LD
 
bk ¼ bk, 1 ,    , bk, pk T is the pk dimensional removes the entire observation row, only retaining
vector of regression coefficients of the k th group. complete cases without missing values. Depending on
 is an n dimensional vector of mean zero. the missingness patterns, LD may lead to seriously
4 J. E. YOO AND M. RHO

low power and invalidation of the analysis (Schafer & EM imputation


Graham, 2002). LD typically requires MCAR (missing EM (expectation and maximization) is considered a
completely at random), the strongest assumption, the state of the art technique for handling missing data
violation of which can result in biased estimates (Schafer & Graham, 2002). EM comprises two steps,
(Allison, 2001). Nevertheless, LD is simple, easy to expectation and maximization, and was named after
use, and does not require special statistical skills. the two steps (Dempster et al., 1977). An expectation
Thus, LD has been the default for handling missing step estimates log-likelihood expectation of missing
data in most statistical software programs. In particu- values with observed data, and the resulting estima-
lar, when predictors have missing values and the miss- tion fills in missing values. The following maximiza-
ingness mechanism is nonignorable, LD may tion step maximizes the conditional log-likelihood
outperform MLE (maximum likelihood estimation), expectations of the filled-in data, updating the param-
yielding less biased estimates (Allison, 2014). eters in the preceding expectation step. These two
Therefore, LD served as the baseline for comparisons steps are repeated until convergence is reached, and a
of missing data techniques. single value fills in each missing data point.
Both k-NN and EM are single-imputation methods,
but they differ in the assumptions on population dis-
K-NN (nearest neighbors) tribution. Unlike the nonparametric k-NN which does
With its intuitively appealing concept and easy-to- not require an explicit modeling for the distribution
understand algorithms, k-NN (k-nearest neighbors) of the missing values, EM is parametric. When the
has been in common use for regression and classifica- MAR or ignorability assumption is satisfied, EM is
tion in predictive modeling (Hastie et al., 2009; known to produce consistent, asymptotically normal,
Huang et al., 2017; Sahri & Yusof, 2014; Song et al., and efficient parameter estimates under suitable regu-
2008; Yoo & Rho, 2020). When the value of k is set, larity conditions (Allison, 2003; Schafer & Graham,
the closest observations or nearest neighbors to a tar- 2002). For the imputation models of EM, this study
get data point are identified with distance calculation included the main effects of all the possible predictors.
in the multidimensional space. Categorical and con- Specifically, the R package Amelia II (Honaker et al.,
tinuous data use majority-voting and averaging of the 2011) was used, and the first set of imputation after
convergence replaced missing values. The multivariate
k closest observations, respectively.
The same logic and notion of k-NN regression and normality assumption of Amelia was not satisfied in
the simulation due to the inclusion of categor-
classification are applied to missing data imputation.
ical predictors.
This time k-NN imputes categories or values of miss-
ing points with majority-voting or averaging of the k
nearest neighbors obtained by distance calculation. Literature reviews
Although calculating distances of k-NN can be
This section consists of four subsections: ML in survey
computationally demanding, particularly with high- data analysis; penalized regression in social science
dimensional data (Troyanskaya et al., 2001), k-NN as research; penalized regression in simulation studies;
a hot-deck imputation method offers the advantage of and research questions of the study.
imputing real values from the data (Beretta &
Santaniello, 2016). Compared to cold-deck imputation
in which missing values are replaced with responses ML in survey data analysis
from past surveys, hot deck imputation replaces miss- Despite the big data era, in which new techniques for
ing values with responses from the current survey, data collection thrive, survey remains one of the main
and thus the imputed values are realistic and within tools to collect data in social science research.
the possible range (Myers, 2011). The tuning param- Scientists of the new era strive to overcome problems
eter of k-NN is the value of k, and what distance with surveys (e.g., low response rates and rising costs)
measure to use depends on the data to be analyzed. by employing ML (Hill et al., 2019). In particular, ML
In our study, the square root of the number of com- has been frequently used in data collection and esti-
plete cases served as k, following Beretta and mation such as creating sampling strata/sampling
Santaniello (2016). Gower distance was selected for units (Buskirk et al., 2018; Eck et al., 2018) or aug-
distance calculation because of its versatility with menting estimation of traditionally collected survey
mixed format data (Gower, 1971). data (Chew et al., 2018).
MULTIVARIATE BEHAVIORAL RESEARCH 5

While survey scientists utilized ML in data collec- is known regarding the performance of penalized
tion and estimation within the traditional framework, regression with missing data in the context of social
some applied ML to traditional survey data from the science large-scale data analysis.
perspective of ML. These studies aimed at construct-
ing prediction models from survey data and compar-
Penalized regression in simulation studies
ing various ML methods. For instance, in a study
analyzing survey data from 13 items answered by 152 Penalized regression has been one of the most popular
high school students, the predictive ability of multi- ML methods in the fields of computer science/engin-
layer perceptron was more accurate than that of deci- eering (Fan et al., 2009; Shen et al., 2016; Zhou et al.,
sion trees in predicting slow learners (Kaur et al., 2015), bioinformatics/medicine (Ma & Huang, 2008;
2015). Mikovic et al. (2019) analyzed online survey Waldmann et al., 2013; Wang et al., 2017), and
data consisting of 59 questions from 215 organizations finance (Kim & Swanson, 2014; Pereira et al., 2016;
and found that artificial neural networks outper- Smeekes & Wijler, 2018; Uematsu & Tanaka, 2019).
formed logistic regression in terms of classifying As missing data are a common problem in data ana-
knowledge levels. lysis regardless of discipline, a body of simulation
studies on penalized regression with missing data has
accumulated in the fields of computer science/engin-
Penalized regression in social science research
eering (Nguyen & Tran, 2012) and bioinformatics/
Among ML methods, penalized regression has been a medicine (Chen & Wang, 2013; Sabbe et al., 2013;
popular method to execute regularization and prevent Wu et al., 2009). However, previous simulation studies
overfitting in fields such as bioinformatics and engin- in these areas do not properly fit the characteristics of
eering. Researchers in psychology and education have social science large-scale data for the follow-
recently begun to pay attention to penalized regres- ing reasons.
sion such as LASSO (Finch & Finch, 2016; McNeish, First, large p, small n (more variables than observa-
2015). Particularly in psychometrics, LASSO has been tions) data were simulated, but social science large-
applied to the detection of differential item function- scale data consist of more observations than variables,
ing (Magis et al., 2015) and item-trait pattern recogni- designed to be subject to traditional statistical meth-
tion in tests based on multidimensional item response ods. Second, previous studies simulated Poisson,
theory (Sun et al., 2016; Sun & Ye, 2019). Studies by gamma, or even semi-parametric distributions, while
Yoo (2018) and Yoo and Rho (2020) advanced this social science large-scale data commonly comprise
line of research a step further by applying penalized Likert-scaled and categorical variables. Third and
regression to large-scale survey data. Penalized regres- relatedly, because categorical variables in penalized
sion performs variable selection under the sparsity regression need to be dummy-coded, penalized regres-
assumption (Hastie et al., 2015), which large-scale sur- sion methods that consider the grouping effect need
vey data are likely to satisfy. to be employed such as group LASSO and group
Specifically, Yoo (2018) used Enet in exploring Enet. Fourth, previous simulation studies addressed
important student and teacher TIMSS (trends in inter- MCAR (missing completely at random) (Chen &
national mathematics and science study) predictors Wang, 2013; Gao & Lee, 2017; Liu et al., 2016; Sabbe
for students’ mathematics achievement and identified et al., 2013; Yu et al., 2013) and sometimes MAR
17 such predictors out of 162 variables. Likewise, Yoo (missing at random) (Johnson et al., 2008; Liu et al.,
and Rho (2020) employed group Mnet to predict 2016; Sabbe et al., 2013; Zahid & Heumann, 2019). It
TALIS (teaching and learning international survey) appears important to evaluate methods for nonignora-
teacher job satisfaction and identified 18 important ble missingness in social science research (Peng
predictors out of 558 variables. However, previous et al., 2006).
penalized regression studies in social science have not
paid enough attention to missing data. McNeish
Research questions of the study
(2015) and Finch and Finch (2016) introduced LASSO
with small data sets and did not mention issues relat- Our literature review reveals the importance of study-
ing to missing data. In Yoo’s study, only 55% of the ing predictive modeling with missing data via penal-
original sample was retained after employing listwise ized regression in simulations reflecting the
deletion, and Yoo and Rho (2020) employed k-NN characteristics of social science large-scale data.
(nearest neighbor) to handle missing data. Not much Specifically, this study selected k-NN and EM
6 J. E. YOO AND M. RHO

Table 1. Distributions and parameters for data generation (n ¼ 2,000).


Variable name Distribution Parameters Categories Number of variables
B1–B10 X  BerðpÞ p ¼ 0:5 0 or 1 10
B11–B20 p ¼ 0:2 10
B21–B30 p ¼ 0:05 10
P1–P5 X  PoiðkÞ k¼3 NA 5
M1–M5 X  MultðpÞ p ¼ 0:1, 0:2, 0:3, 0:4 14 5
L1–L260 X  Np ðl, RÞ lj ¼ 2:6  3:2 ðj ¼ 1, :::, pÞ 15 260
Rij ¼ 0:05  0:70 ði 6¼ jÞ
Rij ¼ 0:6  0:95 ði ¼ jÞ
Total 300

imputation algorithms, representing the nonparamet- on about 2,000 participants, including measures that
ric and parametric missing data techniques, respect- were Likert-scaled (80%), categorical (8%), and count
ively. Listwise deletion (LD) served as the baseline for (12%) variables. Particularly, the Likert-scaled varia-
comparison. For penalized regression, methods deal- bles were enclosed in tens of item parcels, which are
ing with categorical variables were chosen. commonly observed in social science large-scale data.
Specifically, the performance of group LASSO (least With regard to missing rates, approximately 80% of
absolute shrinkage and selection operator), group the 300 variables were completely observed, and the
Enet, and group Mnet were compared under varying remainder had missing rates per variable ranging
conditions with regard to model prediction and vari- from less than 1% to 50%.
able selection. LASSO is one of the most popular Reflecting these characteristics, data of 2,000 partic-
penalized regression methods, and Enet and Mnet ipants were generated using Bernoulli and multi-
handle multicollinearity, using convex and concave nomial distributions for categorical variables, a
penalty functions, respectively. The main research Poisson distribution for count variables, and multi-
questions of the simulation study were as follows: variate normal distributions for Likert-scaled variables.
Table 1 lists the variable names, distributions, parame-
1. Which penalized regression method, group ters, number of categories, and numbers of variables
LASSO, group Enet, or group Mnet, performs generated for the simulation. All the distribution
best in terms of model prediction and vari- parameters were carefully chosen to conform to real
able selection? large-scale data. For instance, a total of 26 sets of item
2. Which missing data technique, k-NN or EM, parcels were drawn from multivariate normal distribu-
shows better recovery when compared to LD, tions, and the Cronbach alphas of the item parcels
when used with MAR or MNAR? ranged between .68 and .95 after data generation. In
particular, a two-stage cluster sampling was employed
To summarize, the main purpose of the research in data generation. After 50 out of 200 clusters (25%)
was to compare the performance of penalized regres- were selected, 40 individuals were randomly sampled
sion methods under various conditions including from each selected cluster with a probability of 50%.
missing data techniques and missingness mechanisms Following the real data that we referenced, individual-
in a Monte Carlo simulation. The simulation study level variables were generated.
was Study I. To illustrate the applicability of the Study
I findings, we also conducted Study II with a real Response variable and the true model
social science large-scale dataset that Study Response variables in social science are frequently bin-
I referenced. ary in nature such as pass/fail or absence/presence of
a certain psychological trait and therefore logistic
Study I: Method regression has been frequently used to analyze these
binary response variables. Of note, binary response
Data generation variables in social science data are likely to be unbal-
Distribution of variables anced. For instance, dropout rates for online learning
Our Monte Carlo simulation study emulated real ranged between 25% to 44% (Willging & Johnson,
social science large-scale data. The actual large-scale 2009), and the ratio of students’ career decision-mak-
survey we referenced employed stratified cluster sam- ing was approximately 4:1 (Germeijs & Verschueren,
pling, and collected variables of individual elements. 2007). To investigate the effect of unbalanced catego-
After data cleaning, the data had about 300 variables ries on prediction and variable selection, the response
MULTIVARIATE BEHAVIORAL RESEARCH 7

Table 2. Variables and coefficients of the true model. EM) were applied to the deleted dataset of Step 2.
Variable name L1 L2 L3 L35 L36 L39 L101 L102 L103 L151 L161 Fourth, the imputed dataset from Step 3 was ran-
Coefficient .3 .3 .3 –.3 –.3 –.3 –.5 –.5 –.5 .2 –.4 domly split into training and test data with a ratio of
Variable name L171 L181 L205 L206 L207 B5 B10 B30 M5.1 M5.2 M5.3
Coefficient .3 –.6 .5 .5 .5 –.3 .3 .3 .2 .2 .2 7:3. In particular, the 4:1 ratio of the binary response
variable was kept within each dataset by employing
variable of this study was generated as 0 or 1 with the stratified sampling. Prediction models were built with
ratio of 4:1, using a logistic regression model in the three penalized regression methods (group
Equation (5). The true model’s variable names and LASSO, group Enet, and group Mnet), applying 10-
coefficients are presented in Table 2. fold CV (cross-validation) to the training data. Of
note, this step was repeated 100 times with random
yi i:i:d: Berðpxi Þ, pxi ¼ Prðyi ¼ 1xi Þ data-splitting. Fifth and last, using the test data from
expðxiT bÞ Step 3, the models built in Step 4 were evaluated in
¼ : (5) terms of evaluation criteria explained in the next sec-
1 þ expðxiT bÞ
tion. The five steps were iterated 100 times. This iter-
ation number of 100 multiplied by the 100 repetitions
Missing data generation in Step 4 resulted in 10,000 models per condition
Missing data were generated with the missingness combination. All the programs were written in R
mechanisms of either MAR or MNAR. Out of a total (refer to Appendix). Specifically, simstudy (Goldfeld,
of 300 predictors, 4 Likert-scaled variables (L10, L20, 2020) and missMethods (Rockel, 2020) were used for
L30, L40) were designed to have no missing values data generation and deletion, VIM (Kowarik &
and to be the cause of missingness. When the model Templ, 2016) and Amelia (Honaker et al., 2011) for
included the 4 complete variables, the missingness missing data imputation, and finally grpreg (Breheny
mechanism was said to satisfy MAR; otherwise, it was & Zeng, 2020) for penalized regression.
MNAR. Each cause of missingness variable exclusively
resulted in missingness of 20 randomly selected varia-
bles; no other variables influenced the missingness of Evaluation criteria
the 20 variables. Thus, the 4 variables of cause of Missing data imputation
missingness resulted in 80 randomly selected variables. The performance of two missing data techniques, k-
Specifically, 60 variables of them had 1% missing per NN and EM, were evaluated using two types of agree-
variable, and the remaining 20 variables had a missing ment rates. First, exact agreement rates were calcu-
rate of 10%. In an effort to reflect real data’s missing- lated for all explanatory variables. Second, for Likert,
ness patterns, the last 5 clusters among the 50 selected multinomial, and count variables, adjacent agreement
clusters had 20% of the data points randomly deleted. rates were also obtained (Gisev et al., 2013; Shweta
On average, approximately 2.8% of the 600,000 (¼ et al., 2015). For instance, when the true value of a
2,000 cases multiplied by 300 variables) data points Likert-scaled variable was 2, the exact agreement rates
were missing across iterations (SD ¼ 2.4%). After only counted the value of 2 as a correct imputation.
applying LD, only 6.4% of the 2,000 cases remained The adjacent agreement rates, on the other hand, con-
(mean ¼ 128, SD ¼ 9.07), which resulted in sidered values 1, 2, and 3 as correct imputation.
wide data. Therefore, adjacent agreement rates were always
higher than exact agreement rates for Likert, multi-
Monte Carlo simulation nomial, and count variables.
This study used 18 Monte Carlo simulation combina-
tions: 2 missingness mechanisms (MAR, MNAR), 3 Model evaluation: Prediction measures
missing data techniques (LD, k-NN, and EM), and 3 Prediction accuracy, AUC (Area Under an ROC
penalized regression methods (group LASSO, group Curve), and kappa were utilized as model evaluation
Enet, and group Mnet). A total of 10,000 datasets criteria. Prediction accuracy was calculated as an aver-
were fitted per condition combination. The steps were age of correctly predicted observations. AUC func-
as follows. First, data were generated with the true tions as the summary of all sensitivity and specificity
model of logistic regression (refer to Tables 1 and 2). results. Specifically, an AUC of 0.5 indicates the
Second, the generated data were deleted depending on model is as good as a random guess, and an AUC of
the missingness mechanisms (MAR and MNAR). 1.0 represents a perfect model. Lastly, kappa measures
Third, three missing data techniques (LD, k-NN, and binary variables’ classification accuracy compared to
8 J. E. YOO AND M. RHO

Table 3. Monte Carlo simulation results: imputation with k- Table 4. Monte Carlo simulation results: prediction Measures.
NN and EM. Accuracy AUC Kappa
Missingness Missing Missing Mean Mean Mean
mechanism MAR MNAR mech. Methods technique (SD) (SD) (SD)
Missing technique k-NN EM k-NN EM Complete data Group .97(.01) .95(.02) .91(.03)
Exact agreement rate .51(.14) .36(.13) .51(.14) .36(.13) LASSO
Adjacent agreement rate .93(.11) .79(.10) .95(.12) .82(.14) Group NA .97(.01) .94(.02) .89(.03)
Enet
Note. Numbers in parentheses indicate the standard deviations. Group .98(.01) .96(.02) .92(.03)
Mnet
random guessing. A kappa value of 1.0 indicates a MAR Group LD .82(.06) .64(.11) .38(.20)
LASSO k-NN .96(.01) .93(.02) .87(.03)
perfect fit. According to Landis and Koch (1977), val- EM .95(.01) .92(.02) .85(.03)
ues between .21 and .40 indicate fair agreement, val- Group LD .82(.06) .64(.11) .39(.20)
Enet k-NN .95(.01) .92(.02) .86(.03)
ues between .41 and .60 moderate agreement, values EM .95(.01) .91(.02) .84(.04)
between .61 and .80 substantial agreement, and values Group LD .80(.07) .61(.11) .32(.21)
Mnet k-NN .96(.01) .94(.02) .89(.03)
above .80 represent almost perfect agreement. EM .96(.01) .93(.02) .87(.04)
MNAR Group LD .82(.06) .64(.11) .38(.20)
LASSO k-NN .96(.01) .93(.02) .87(.03)
Model evaluation: Variable selection EM .95(.01) .91(.02) .85(.04)
The selection or non-selection of each variable in Group LD .82(.06) .64(.11) .39(.20)
Enet k-NN .95(.01) .92(.02) .86(.03)
10,000 iterations served as the selection counts of the EM .95(.01) .91(.02) .84(.04)
study, and this study presented the numbers of Group LD .80(.07) .61(.11) .32(.21)
Mnet k-NN .96(.01) .94(.02) .89(.03)
selected variables in 1 or more, 2,500 (25%) or more, EM .96(.01) .92(.02) .87(.04)
5,000 (50%) or more, 7,500 (75%) or more, and all
10,000 (100%) iterations in each condition combin-
were also larger than those of MAR; their confidence
ation. Particularly for the variables selected 50% or
intervals almost entirely overlapped.
more, 75% or more, and 100% of the time, IC1 and
IC2 were calculated to evaluate variable selection per-
formance (Zhang et al., 2014). IC1 counts the number
Model evaluation: Prediction measures
of unselected true variables, while IC2 counts the
number of wrongfully selected variables. Therefore, The evaluation criteria for prediction were accuracy,
models of the smaller IC1 and IC2 indicate better per- AUC, and kappa. Table 4 presents means and maxi-
formance in terms of variable selection. mums of each prediction measure after 10,000 itera-
tions. Complete data results are presented for
comparison. All three penalized regression methods
Study I: Results
yielded a mean accuracy of about .97, indicating all
Missing data imputation the penalized regression models correctly classified
In order to examine the performance of imputation 97% of new observations on average. Means of AUC
techniques, the results of k-NN and EM were com- ranged between .94 and .96, which were well above
pared in terms of exact and adjacent agreement rates. the 0.5 of random guessing. Means of kappa were
Table 3 summarizes the means and standard devia- between .89 and .92, indicating almost perfect agree-
tions of agreement rates after 10,000 iterations. ment (Landis & Koch, 1977).
Overall, missingness mechanisms did not appear to After deletion and imputation, k-NN and EM out-
have an influence on either exact or adjacent agree- performed LD. The mean accuracy values of k-NN
ment rates, but k-NN outperformed EM with regard and EM were almost the same as those of complete
to imputation. Both exact and adjacent agreement data, ranging between .95 and .96, but those of LD
rates of k-NN were higher than those of EM. The were lower, ranging between .80 and .82. AUC
exact agreement rates of k-NN were .51, while those showed similar patterns. The mean AUC values of k-
of EM were .36 regardless of missingness mechanism. NN and EM ranged between .92 and .94 and between
The adjacent agreement rates nearly doubled but .91 and .93, respectively, and those of LD ranged
showed a similar pattern. Specifically, those of k-NN between .61 and .64. While the mean kappa values of
were .93 and .95, while those of EM were .79 and .82 k-NN and EM were between .86 and .89 and between
for MAR and MNAR, respectively. Although the adja- .84 and .87, respectively, kappa coupled with LD
cent agreement rates of MNAR were 2 to 3%p higher deteriorated; the means ranged from .32 to .39 and
than those of MAR, the standard deviations of MNAR the standard deviations were larger than those of k-
MULTIVARIATE BEHAVIORAL RESEARCH 9

Table 5. Monte Carlo simulation results: selection counts. Table 6. Monte Carlo simulation results: IC1 and IC2.
Missing Missing 5,000 7,500 ¼10,000
mech. Methods technique 1 2,500 5,000 7,500 ¼10,000 G Missing Missing
mech. Methods technique IC1 IC2 IC1 IC2 IC1 IC2
Complete Group NA 300 299 24 20 17 .95
data LASSO Complete Group NA 0 4 0 0 3 0
Group 300 300 24 20 17 .96 data LASSO
Enet Group 0 4 0 0 3 0
Group 300 19 19 19 9 .18 Enet
Mnet Group 1 0 1 0 11 0
MAR Group LASSO LD 300 13 6 0 0 .01 mnet
k-NN 300 278 24 20 13 .85 MAR Group LASSO LD 14 0 20 0 20 0
EM 300 271 20 20 12 .82 k-NN 0 4 0 0 7 0
Group LD 300 16 7 0 0 .02 EM 0 0 0 0 8 0
Enet k-NN 300 300 24 20 13 .89 Group LD 13 0 20 0 20 0
EM 300 300 30 20 12 .86 Enet k-NN 0 4 0 0 7 0
Group LD 297 4 0 0 0 .00 EM 0 10 0 0 8 0
Mnet k-NN 300 19 19 18 6 .05 Group LD 20 0 20 0 20 0
EM 300 19 18 18 3 .02 Mnet k-NN 1 0 2 0 14 0
MNAR Group LASSO LD 296 13 6 0 0 .01 EM 2 0 2 0 17 0
k-NN 296 273 24 20 15 .86 MNAR Group LASSO LD 14 0 20 0 20 0
EM 296 255 21 20 13 .81 k-NN 0 4 0 0 5 0
Group LD 296 16 7 0 0 .02 EM 0 1 0 0 7 0
Enet k-NN 296 296 24 20 16 .89 Group LD 13 0 20 0 20 0
EM 296 294 31 20 13 .84 Enet k-NN 0 4 0 0 4 0
Group LD 294 4 0 0 0 .00 EM 0 11 0 0 7 0
Mnet k-NN 296 19 19 18 6 .04 Group LD 20 0 20 0 20 0
EM 296 19 18 18 4 .00 Mnet k-NN 1 0 2 0 14 0
EM 2 0 2 0 16 0
Note. The last column, G, indicates the ratio of successfully detecting the
multiple-category variable in the 10,000 iterations.
multiple-category variable in the prediction model
NN and EM. The standard deviations of kappa were 95% of the time.
also larger than those for accuracy and AUC. Overall, missingness mechanisms did not have any
To summarize, both k-NN and EM were superior to interpretable effects on selection counts but penalized
LD, and k-NN tended to slightly outperform EM in regression methods produced an apparent pattern.
terms of prediction. Their results appeared to be com- Consistent with previous literature (Huang et al.,
parable to those of complete data regardless of missing- 2016), Mnet selected much fewer variables than
ness mechanisms and penalized regression methods. LASSO and Enet regardless of missingness mecha-
When the other conditions were the same, penalized nisms and missing data techniques. Relatedly, Mnet
regression methods and missingness mechanisms did selected the multiple-category variable at best 5% of
not appear to have an effect on accuracy, AUC, or the time. On the other hand, LASSO and Enet suc-
kappa of the prediction models. Among prediction cessfully detected this variable more than 80% of the
measures, kappa appears the most sensitive. In particu- time, when either k-NN or EM was employed.
lar, kappa coupled with LD produced the worst results. Specifically, the ratio was between 85% and 89% for
k-NN and between 81% and 86% for EM. Among
missing data techniques, k-NN and EM selected more
Model evaluation: Variable selection
variables than LD. LD also selected variables in an
Table 5 summarizes the results of selection counts inconsistent way. No variables were selected in 75%
after 10,000 iterations. Complete data results are also or more iterations with LD, while k-NN and EM
presented for reference. The first row of Table 5 indi- selected 20 variables with LASSO or Enet and 18 vari-
cates that all the 300 variables were selected at least ables with Mnet. Enet tended to select more variables
once in the 10,000 iterations when complete data were than LASSO, but this difference evened out after
analyzed with LASSO. This number dropped to 299, applying selection counts of 75% or more.
24, 20, and 17 for 2,500 (25%) or more, 5,000 (50%) While Table 5 presents both true and false variables
or more, 7,500 (75%) or more, and finally all 10,000 selected in the models, Table 6 separates false selec-
(100%) selections in the 10,000 iterations, respectively. tion and true non-selection after applying 50%, 75%,
In particular, a multiple-category variable (M5) was and 100% selections; IC1 and IC2 represent the num-
included in the true model (refer to Table 2), and the bers of true variables unselected and false variables
ratio that this variable was selected in the iterations is selected, respectively (Zhang et al., 2014). Of note,
presented in the last column, G. The value of .95 indi- IC1 of 50% or more selections should be always equal
cates that the given condition correctly included the to or smaller than that of all 100% selections, since
10 J. E. YOO AND M. RHO

IC1 indicates the number of true variables unselected. large-scale data, we conducted Study II. The sample
IC2 has exactly the opposite characteristics; IC2 of was from a social science large-scale data, the main
50% or more selections should be equal to or larger topics of which included adolescents’ cognitive/phys-
than that of 100% selections. When complete data ical development, socialization skills, delinquency, and
were analyzed with 50% selection, LASSO and Enet career path. A total of 1,658 students and their
selected not only all true variables but also 4 false var- parents answered to 476 variables. Categorical varia-
iables; Mnet missed 1 true variable and selected no bles were dummy-coded. In particular, six variables
false variables. When 100% selection counts were had more than 2 categories, which was the essential
applied, none of the penalized regression methods motivation of using group LASSO, group Enet, and
selected false variables, but LASSO and Enet missed 3 group Mnet. The set of dummy variables from a cat-
true variables and Mnet missed 11 true variables. The egorical variable was treated as a group in vari-
75% selection counts turned out to be the best. Both able selection.
LASSO and Enet selected all true variables and no After data cleaning, the final dataset consisted of
false variables. Likewise, Mnet missed only 1 true vari- 310 variables of 1,658 adolescents. The response vari-
able and selected no false variables. able was life satisfaction. Approximately 76.3% (1,265)
The following results were obtained after deletion
of the adolescents were generally satisfied with their
and imputation. Among missing data techniques, LD
life, and 23.7% (393) of them were not. The other 309
performed the worst. LD missed 13 to 20 true variables
variables served as explanatory variables of the study.
with 50% selection counts and all true variables after
Specifically, there were 234 variables (75%) measured
75% or more selection counts. This was consistent with
by Likert-scales, 17 variables of binary categories, 6
Table 5 results showing that neither true nor false vari-
variables of more than 2 categories, and 44 variables
ables remained with LD after applying 75% selections
of count data (e.g., hours, frequencies). Out of 309
or more. When combined with k-NN or EM, LASSO
and Enet with 50% selection counts resulted in not variables, approximately 78% of them were completely
only all true variables but also 1 to 11 false variables; observed, and 10% and 5% had missing rates less than
Mnet selected no false variables but missed 1 to 2 true 1% and between 1% and 10%, respectively. The
variables in one out of two runs. After applying 100% remaining 6% and 1% of the variables had missing
selections, no false variables were selected regardless of rates ranging between 10% and 30% and between 30%
penalized regression methods, but LASSO, Enet, and to 50%, respectively.
Mnet missed 5 to 8, 4 to 8, and 14 to 17 true variables,
respectively. On the other hand, the results of 75% Methods
selections were remarkable, when combined with either
k-NN or EM. In particular, the IC1 and IC2 values of Random forest
LASSO and Enet were all 0; Mnet selected no false var- Aforementioned, one of the main advantages of
iables and missed only 2 true variables. penalized regression includes interpretable predic-
To summarize, both IC1 and IC2 results shown in tion models, as penalized regression assumes linear-
Table 6 shared the overall trends with the selection ity. However unlike simulation, we cannot be sure
count results of Table 5. Mnet selected much fewer of the distributional characteristics of empirical
variables than LASSO or Enet, and thus tended to miss data, and the true model might be nonlinear.
more true variables. Among missing data techniques, Therefore, we added random forest (RF) to the set
k-NN and EM outperformed LD. Missingness mecha- of analysis methods for comparison purposes. As a
nisms did not appear to have any apparent effect. Of famous nonlinear ML method, RF is known to pro-
note was the finding that 75% selections outperformed duce difficult to interpret but predictive models
50% or 100% selections. After applying 75% selections, (Barboza et al., 2017; Chen et al., 2017; Idris et al.,
LASSO and Enet combined with k-NN or EM perfectly 2012; Leiß et al., 2012; Polishchuk et al., 2009).
selected all the true variables out of 300 candidates. Having decision trees as the base learner, RF con-
sists of complex interaction effects. The number of
Study II: Empirical data analysis trees, the minimum size of terminal nodes, and the
number of variables randomly sampled at each split
Sample were 500, 1, and 17, respectively, following the
To examine the possible applicability of penalized default for classification problems (Breiman, 2001;
regression and missing data techniques with real Breiman et al., 2018).
MULTIVARIATE BEHAVIORAL RESEARCH 11

Table 7. Prediction measures of study II. Table 8. Selection counts of study II.
Accuracy AUC Kappa Missing
Missing Mean Mean Mean Methods technique ⭌1 ⭌25 ⭌50 ⭌75 ¼100
Methods technique (SD) (SD) (SD) Group LD 160 19 10 4 0
Group LD .79(.04) .62(.07) .29(.15) LASSO k-NN 251 89 44 25 9
LASSO k-NN .85(.01) .74(.02) .54(.04) EM 250 89 44 26 11
EM .85(.01) .74(.02) .54(.04) Group LD 200 23 12 5 0
Group LD .79(.04) .62(.09) .30(.18) Enet k-NN 263 92 47 26 10
Enet k-NN .85(.01) .74(.02) .53(.04) EM 275 95 47 26 10
EM .85(.01) .74(.02) .54(.04) Group LD 52 8 1 0 0
Group LD .76(.04) .57(.06) .21(.13) Mnet k-NN 89 14 7 7 2
Mnet k-NN .84(.01) .74(.02) .52(.04) EM 94 16 9 7 2
EM .84(.01) .74(.02) .53(.04)
Random LD .78(.03) .59(.06) .22(.13)
forest k-NN .84(.01) .68(.02) .45(.04) consistent with those of Study I. Which penalized
EM .83(.01) .68(.02) .44(.04)
regression method to employ did not have an effect
on accuracy, AUC, and kappa. For instance, means of
Cross-validation and variable selection accuracy ranged between .83 and .85 regardless of
After LD, k-NN, and EM were separately applied to analysis methods. Among missing data techniques,
the dataset, the steps for data analysis were as follows. both k-NN and EM were superior to LD regardless of
First, each imputed dataset was randomly split into analysis methods, which was most apparent with
training and test data with a ratio of 7:3. Second, the kappa. The standard deviations of kappa were also
three penalized regression methods (group LASSO, larger than those of accuracy and AUC. The differ-
group Enet, and group Mnet) and RF were applied to ence between k-NN and EM was minimal throughout
training data to build prediction models. In particular, condition combinations.
a 10-fold CV (cross-validation) was executed in every As RF does not select variables, we present the
iteration of each penalized regression method. Third, selection counts of penalized regression after 100 iter-
using the test data from Step 1, the models built in ations (Table 8). The results are also generally consist-
Step 2 were evaluated in terms of prediction measures. ent with Study I. Mnet selected much fewer variables
The three steps were repeated 100 times after random than LASSO and Enet. Enet tended to select more
data-splitting. This is the same as the shuffled k-fold variables than LASSO, but this difference evened out
CV (cross-validation) in the deep learning literature after applying selection counts of 50% or more.
(e.g., Herent et al., 2019; Inoue et al., 2019). Among missing data techniques, k-NN and EM
Notably, Study II executed subsampling techniques selected more variables than LD, but no obvious pat-
for variable selection in penalized regression in order tern was found between k-NN and EM. LD selected
to consider the bias resulting from data-splitting in variables in an inconsistent way. No variables were
model validation (Meinshausen & B€ uhlmann, 2010; selected in all the 100 iterations with LD. By contrast,
Shevade & Keerthi, 2003). The selection or non-selec- 9 to 11 variables were selected in every iteration in
tion of each variable in the 100 iterations served as the condition of k-NN or EM coupled with LASSO or
the selection counts, and Study II presented the num- Enet, and there were 2 such variables in the condition
bers of selected variables in 1 or more, 25 or more, 50 of k-NN or EM coupled with Mnet.
or more, 75 or more, and all 100 iterations in each
condition combination of penalized regression.
Discussion
Our study proposed the use of penalized regression
Results
and investigated its performance on incomplete large-
Table 7 presents means and standard deviations of scale data in a Monte Carlo simulation. Specifically,
each prediction measure after 100 iterations. The pre- penalized regression methods that consider the group-
diction measures of penalized regression tended to ing effect were employed such as group LASSO, as
show better performance than those of RF, a well- predictors of multiple categories were included in
known method for its prediction capabilities. modeling. Implications of the study are as follows.
Specifically when combined with either k-NN or EM, First, selection counts turned out to be a necessary
means of AUC were all .74 for penalized regression tool for implementing penalized regression when vari-
and .68 for RF. Likewise, means of kappa ranged able selection is of research interest. Although there
between .52 and .54 with penalized regression and were only 20 true variables out of 300, all the 300 var-
between .44 and .45 with RF. The other results are iables were selected at least once in the 10,000
12 J. E. YOO AND M. RHO

iterations, depending on condition combination. In variables were set as the cause of missingness, and
the real data analysis as well, up to 275 predictors out inclusion or exclusion of them in the model deter-
of 309 were selected at least once in the 100 iterations; mined whether the data satisfied MAR (inclusion of
a single run of penalized regression is never recom- the 4 variables) or MNAR (exclusion of them). To
mended. Selection counts also relate to which penal- summarize, exclusion of approximately 1.3% (¼4/300)
ized regression method to employ. Particularly when of the total variables did not influence results,
combined with either k-NN or EM, the 75% of selec- although they were the cause of missingness. In other
tion counts of LASSO or Enet yielded perfect results words, efforts to utilize large-scale data to the fullest
in terms of variable selection. On the other hand, the appear to offer a valid approach to mitigate the effect
50% selection counts served as a reasonable criterion of nonignorable missingness.
with Mnet, which coincides with the suggestion of Finally, penalized regression which assumes linear-
Yoo and Rho (2020) that variables selected 50% or ity appears to be a viable method to analyze social sci-
more of the time with Mnet and k-NN are worthy of ence large-scale data. As most empirical studies
further investigation. Although the prediction of Mnet analyzing social science large-scale data create linear
was comparable to those of LASSO and Enet, Mnet models such as generalized linear modeling, penalized
detected the multiple-category variable at best 5% of regression is considered a relevant extension. Linear
the time. If identifying multiple-category variables is models are easier to interpret than nonlinear models.
of research interest, LASSO or Enet is superior to Furthermore, Study II (empirical study) showed that
Mnet. Furthermore, the difference between LASSO RF was inferior or at best comparable to penalized
and Enet in the number of variables selected subsided regression in terms of prediction despite its fame as a
after selection counts of 75% or more. This is good nonlinear ML method of prediction capabilities.
news in that there are many extensions of LASSO Although the distributional properties of Study II data
being developed, yet not much attention has been are unknown, penalized regression consisting of only
given to the relatively newer methods including Enet main-effects outperformed RF consisting of higher-
and Mnet. order, interaction effects; linearity may be a valid
Second, either k-NN or EM is recommended for assumption. When all the factors are taken together,
penalized regression with missing data. LD showed penalized regression is preferred to nonlinear ML
the worst performance in terms of prediction meas- methods with regard to social science large-scale
ures and also was unable to select variables consist- data analysis.
ently; regardless of penalized regression methods and
missingness mechanisms, there was no variable com-
Limitations and future directions
monly selected in 75% or more iterations with LD.
On the other hand, both k-NN and EM yielded com- In conclusion, penalized regression served as a prom-
parable prediction results to those of complete data. ising ML method in analyzing large-scale social sci-
Particularly, k-NN outperformed EM in imputation ence data. As one of the first Monte Carlo simulation
alone. Due to the inclusion of categorical predictors, studies to investigate penalized regression with social
the multivariate normality assumption of EM imput- science large-scale data in the presence of missing cat-
ation did not hold, which partly explains the poorer egorical predictors, this study has opened up multiple
performance of EM. When combined with penalized research topics. First, further studies are needed on
regression, however, k-NN and EM yielded similar statistical inference with penalized regression. While
prediction results. The possible interaction between penalized regression produces biased estimates and
penalized regression and missing data techniques thus obtaining confidence intervals or p-values is not
deserves further attention. Besides, we generated feasible, employing post-selection inference (Lee et al.,
Likert-scaled items from distributions of multivariate 2016; Tibshirani et al., 2016) or debiasing techniques
normality. The effect of Likert-scaled items from non- (Javanmard & Montanari, 2014; Van de Geer et al.,
normal (i.e., skewed or kurtotic) distributions on 2014) enables statistical inference. Although rapidly
penalized regression with imputation techniques developing, these extensions are currently only avail-
appears another interesting research topic to able with LASSO.
investigate. Second, multistage sampling schemes (e.g., strati-
Third, missingness mechanisms (MAR and MNAR) fied cluster sampling) are common in large-scale sur-
had little effect on prediction measures. Out of 300 vey data, and utilizing the complex membership
variables generated in this simulation study, 4 information in analysis will be more realistic. We
MULTIVARIATE BEHAVIORAL RESEARCH 13

endeavored to emulate social science large-scale data Ethical principles: The authors affirm having followed pro-
by employing a 2-stage cluster sampling in data gener- fessional ethical guidelines in preparing this work. These
ation, and utilized the membership information in guidelines include obtaining informed consent from human
participants, maintaining ethical treatment and respect for
data deletion and imputation with EM. However, the the rights of human or animal participants, and ensuring
literature on penalized regression incorporating sam- the privacy of participants and their data, such as ensuring
pling schemes is rather limited, and consequently that individual participants cannot be identified in reported
there are few software packages to consider the sam- results or from publicly available original or archival data.
pling schemes in penalized regression. For instance, Funding: This work was not supported.
grpreg, the software package to execute group Mnet,
Role of the funders/sponsors: None of the funders or
does not offer an option to incorporate sampling sponsors of this research had any role in the design and
weights, and thus the sampling schemes were not con- conduct of the study; collection, management, analysis, and
sidered in the estimation phase. The effect of sampling interpretation of data; preparation, review, or approval of
weights in penalized regression needs to be investi- the manuscript; or decision to submit the manuscript for
gated, particularly in simulation studies employing publication.
multistage sampling schemes. Acknowledgments: The previous versions of this manu-
Third, penalized regression methods for hierarch- script were presented at the 2019 American Educational
ical data are another worthwhile topic to explore. Research Association (Toronto, Canada) and the 2020
National Council on Measurement in Education (San
Both the simulation and empirical studies (Studies I
Francisco, USA).
and II) did not have cluster-level predictors. However,
many social science data have nested structures such
as clients nested within a center or participants meas- ORCID
ured multiple times, and including cluster-level pre- Jin Eun Yoo http://orcid.org/0000-0002-7082-5409
dictors in modeling requires algorithms to handle the Minjeong Rho http://orcid.org/0000-0002-1781-5674
nested structure of the data. To our knowledge,
glmmLasso by Groll and Tutz (2014) is currently one
References
of the few penalized regression algorithms to serve the
purpose and to be constantly maintained and updated. Allison, P. D. (2001). Missing data. Sage publications.
The algorithm glmmLasso is a combination of gener- Allison, P. D. (2003). Missing data techniques for structural
equation modeling. Journal of Abnormal Psychology,
alized linear mixed models (GLMM) and LASSO. 112(4), 545–557. https://doi.org/10.1037/0021-843X.112.4.
However as the name denotes, glmmLasso only deals 545
with LASSO. As considering hundreds or thousands Allison, P. D. (2014). Listwise deletion: It’s NOT evil. http://
of variables of social science large-scale data in one statisticalhorizons.com/listwise-deletion-its-not-evil
model is likely to result in collinearity problems, other Alsheikh, M. A., Lin, S., Niyato, D., & Tan, H. P. (2014).
Machine learning in wireless sensor networks:
penalized regression algorithms handling collinearity
Algorithms, strategies, and applications. IEEE
for nested data need to be developed, say glmmEnet Communications Surveys & Tutorials, 16(4), 1996–2018.
or glmmMnet. Another issue of glmmLasso includes https://doi.org/10.1109/COMST.2014.2320099
its excruciatingly slow performance with large-scale Barboza, F., Kimura, H., & Altman, E. (2017). Machine
data. Particularly when cluster-level predictors are learning models and bankruptcy prediction. Expert
added to the model, the speed of glmmLasso deterio- Systems with Applications, 83, 405–417. https://doi.org/10.
1016/j.eswa.2017.04.006
rates, as the number of parameters to estimate
Beretta, L., & Santaniello, A. (2016). Nearest neighbor
expands and glmmLasso searches for an optimum imputation algorithms: A critical evaluation. BMC
step size for update and convergence in each iteration Medical Informatics and Decision Making, 16(S3),
(Kim & Yoo, 2020). We strongly suggest faster and 197–208. https://doi.org/10.1186/s12911-016-0318-z
more efficient penalized regression algorithms to deal Boiy, E., & Moens, M. F. (2009). A machine learning
with multicollinearity be developed for analyses of approach to sentiment analysis in multilingual Web texts.
Information Retrieval, 12(5), 526–558. https://doi.org/10.
nested data.
1007/s10791-008-9070-z
Breheny, P., & Zeng, Y. (2020). Package ‘grpreg’. https://
cran.r-project.org/web/packages/grpreg/grpreg.pdf
Article information
Breiman, L. (2001). Random forests. Machine Learning,
Conflict of interest disclosures: Each author signed a form 45(1), 5–32.
for disclosure of potential conflicts of interest. No authors Breiman, L., Cutler, A., Liaw, A., Wiener, M. (2018).
reported any financial or other conflicts of interest in rela- Package ’randomForest’. https://cran.r-project.org/web/
tion to the work described. packages/randomForest/randomForest.pdf
14 J. E. YOO AND M. RHO

Buskirk, T. D., Bear, T., & Bareham, J. (2018, October Gisev, N., Bell, J. S., & Chen, T. F. (2013). Interrater agree-
25–27). Machine made sampling designs: Applying ment and interrater reliability: Key concepts, approaches,
machine learning methods for generating stratified sam- and applications. Research in Social & Administrative
pling designs [Paper presentation]. [Conference presenta- Pharmacy: RSAP, 9(3), 330–338. https://doi.org/10.1016/j.
tion] Big Data Meets Survey Science Conference, sapharm.2012.04.004
Barcelona, Spain. Goldfeld, K. (2020). Package ‘simstudy’. https://CRAN.R-
Bzdok, D., Altman, N., & Krzywinski, M. (2018). Statistics project.org/package=simstudy
versus machine learning. Nature Methods, 15(4), Gower, J. C. (1971). A general coefficient of similarity and
233–234. https://doi.org/10.1038/nmeth.4642 some of its properties. Biometrics, 27(4), 857–871. https://
Campbell, D. T., & Stanley, J. C. (1963). Experimental doi.org/10.2307/2528823
and quasi-experimental designs for research. Rand Groll, A., & Tutz, G. (2014). Variable selection for general-
McNally. ized linear mixed models by L 1-penalized estimation.
Chen, Q., & Wang, S. (2013). Variable selection for multi- Statistics and Computing, 24(2), 137–154. https://doi.org/
ply-imputed data with application to dioxin exposure 10.1007/s11222-012-9359-z
study. Statistics in Medicine, 32(21), 3646–3659. https:// Hastie, T., Qian, J. (2016). An introduction to glmnet.
doi.org/10.1002/sim.5783 https://cloud.r-project.org/web/packages/glmnet/vignettes/
Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, glmnet.pdf
D. T., Duan, Z., & Ma, J. (2017). A comparative study of Hastie, T., Tibshirani, R., & Friedman, J. (2009). The ele-
logistic model tree, random forest, and classification and ments of statistical learning: Data mining, inference, and
regression tree models for spatial prediction of landslide prediction. Springer Science & Business Media.
susceptibility. CATENA, 151, 147–160. https://doi.org/10. Hastie, T., Tibshirani, R., & Wainwright, M. (2015).
1016/j.catena.2016.11.032 Statistical learning with sparsity: The lasso and generaliza-
Chew, R., Jones, K., Unangst, J., Cajka, J., Allpress, J., tions. CRC Press.
Amer, S., & Krotki, K. (2018). Toward model-generated Herent, P., Schmauch, B., Jehanno, P., Dehaene, O.,
household listing in low-and middle-income countries Saillard, C., Balleyguier, C., Arfi-Rouche, J., & Jegou, S.
using deep learning. ISPRS International Journal of Geo- (2019). Detection and characterization of MRI breast
Information, 7(11), 448. https://doi.org/10.3390/ lesions using deep learning. Diagnostic and Interventional
ijgi7110448 Imaging, 100(4), 219–225. https://doi.org/10.1016/j.diii.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). 2019.02.008
Maximum likelihood from incomplete data via the EM Hill, C. A., Biemer, P., Buskirk, T., Callegaro, M., Cazar,
algorithm. Journal of the Royal Statistical Society: Series B A. L. C., Eck, A., Japec, L., Kirchner, A., Kolenikov, S.,
(Methodological), 39(1), 1–38. https://doi.org/10.1111/j. Lyberg, L., & Sturgis, P. (2019). Exploring new statistical
2517-6161.1977.tb01600.x frontiers at the intersection of survey science and big
Eck, A., Buskirk, T., Fletcher, K., Stefek, P., Shao, H., Park, data: Convergence at “BigSurv18”. Survey Research
K., & Losch, M. (2018, October 25–27). Machine made Methods, 13(1), 123–134. https://doi.org/10.18148/srm/
sampling frames: Creating sampling frames of windmills 2019.v1i1.7467
and other non-traditional sampling units using machine Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression:
learning with neural networks [Paper presentation]. Biased estimation for nonorthogonal problems.
[Conference presentation] Big Data Meets Survey Science Technometrics, 12(1), 55–67. https://doi.org/10.1080/
Conference, Barcelona, Spain. 00401706.1970.10488634
Fan, J., Feng, Y., & Wu, Y. (2009). Network exploration via Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A
the adaptive LASSO and SCAD penalties. The Annals of program for missing data. Journal of Statistical Software,
Applied Statistics, 3(2), 521–541. https://doi.org/10.1214/ 45(7), 1–47. https://doi.org/10.18637/jss.v045.i07
08-AOAS215SUPP Huang, J., Breheny, P., Lee, S., Ma, S., & Zhang, C. H.
Fan, J., & Li, R. (2001). Variable selection via nonconcave (2016). The Mnet method for variable selection. Statistica
penalized likelihood and its oracle properties. Journal of Sinica, 26(3), 903–923. https://doi.org/10.5705/ss.202014.
the American Statistical Association, 96(456), 1348–1360. 0011
https://doi.org/10.1198/016214501753382273 Huang, J., Keung, J. W., Sarro, F., Li, Y. F., Yu, Y. T., Chan,
Finch, W. H., & Finch, M. E. H. (2016). Regularization W. K., & Sun, H. (2017). Cross-validation based K near-
methods for fitting linear models with small sample sizes: est neighbor imputation for software quality datasets: An
Fitting the Lasso estimator using R. Practical Assessment. empirical study. Journal of Systems and Software, 132,
Research & Evaluation, 21(7), 1–13. https://doi.org/10. 226–252. https://doi.org/10.1016/j.jss.2017.07.012
7275/jr3d-cq04 Huang, J., Ma, S., & Zhang, C. H. (2008). Adaptive Lasso
Gao, Q., & Lee, T. C. (2017). High-dimensional variable for sparse high-dimensional regression models. Statistica
selection in regression and classification with missing Sinica, 18(4), 1603–1618.
data. Signal Processing, 131, 1–7. https://doi.org/10.1016/j. Idris, N. M., Gnanasammandhan, M. K., Zhang, J., Ho,
sigpro.2016.07.014 P. C., Mahendran, R., & Zhang, Y. (2012). In vivo photo-
Germeijs, V., & Verschueren, K. (2007). High school stu- dynamic therapy using upconversion nanoparticles as
dents’ career decision-making process: Consequences for remote-controlled nanotransducers. Nature Medicine,
choice implementation in higher education. Journal of 18(10), 1580–1585. https://doi.org/10.1038/nm.2933
Vocational Behavior, 70(2), 223–241. https://doi.org/10. Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar,
1016/j.jvb.2006.10.004 A., Ko, B. J., Greco, N., & Tachibana, R. (2019, October
MULTIVARIATE BEHAVIORAL RESEARCH 15

25–26). Domestic activities classification based on CNN Meinshausen, N., & B€ uhlmann, P. (2006). High-dimensional
using shuffling and mixing data augmentation [Paper graphs and variable selection with the Lasso. The Annals
presentation]. [Conference presentation] Detection and of Statistics, 34(3), 1436–1462. https://doi.org/10.1214/
Classification of Acoustic Scenes and Events 2019, New 009053606000000281
York, NY. http://dcase.community/documents/work- Meinshausen, N., & B€ uhlmann, P. (2010). Stability selection.
shop2019/proceedings/DCASE2019Workshop_Inoue_20. Journal of the Royal Statistical Society: Series B (Statistical
pdf Methodology), 72(4), 417–473. https://doi.org/10.1111/j.
Javanmard, A., & Montanari, A. (2014). Confidence inter- 1467-9868.2010.00740.x
vals and hypothesis testing for high-dimensional regres- Mikovic, R., Arsic, B., Gligorijevic, D., Gacic, M., Petrovic,
sion. The Journal of Machine Learning Research, 15(1), D., & Filipovic, N. (2019). The influence of social capital
2869–2909. https://www.jmlr.org/papers/volume15/javan- on knowledge management maturity of nonprofit organi-
mard14a/javanmard14a.pdf zations-predictive modelling based on a multilevel ana-
Johnson, B. A., Lin, D. Y., & Zeng, D. (2008). Penalized lysis. IEEE Access, 7, 47929–47943. https://doi.org/10.
estimating functions and variable selection in semipara- 1109/ACCESS.2019.2909812
metric regression models. Journal of the American Myers, T. A. (2011). Goodbye, listwise deletion: Presenting
Statistical Association, 103(482), 672–680. https://doi.org/ hot deck imputation as an easy and effective tool for
10.1198/016214508000000184 handling missing data. Communication Methods and
Kaur, P., Singh, M., & Josan, G. S. (2015). Classification Measures, 5(4), 297–310. https://doi.org/10.1080/
and prediction based data mining algorithms to predict 19312458.2011.624490
slow learners in education sector. Procedia Computer Nguyen, N. H., & Tran, T. D. (2012). Robust lasso with
Science, 57, 500–508. https://doi.org/10.1016/j.procs.2015. missing and grossly corrupted observations. IEEE
07.372 Transactions on Information Theory, 59(4), 2036–2058.
Kim, H. H., & Swanson, N. R. (2014). Forecasting financial https://doi.org/10.1109/TIT.2012.2232347
and macroeconomic variables using data reduction meth- Peng, C. Y. J., Harwell, M., Liou, S. M., & Ehman, L. H.
ods: New empirical evidence. Journal of Econometrics, (2006). Advances in missing data methods and implica-
178, 352–367. https://doi.org/10.1016/j.jeconom.2013.08. tions for educational research. In S. Sawilowsky (Ed.),
033 Real data analysis (pp. 31–78). Information Age.
Kim, H. G., & Yoo, J. E. (2020). ICILS 2018 variable explor- Pereira, J. M., Basto, M., & da Silva, A. F. (2016). The logis-
ation to predict computer and information literacy: tic lasso and ridge regression in predicting corporate fail-
Variable selection in multilevel modeling via glmmLasso. ure. Procedia Economics and Finance, 39, 634–641.
Journal of Education Science, 22(4), 1–21. https://doi.org/ https://doi.org/10.1016/S2212-5671(16)30310-0
10.15564/jeju.2020.11.22.4.1 Polishchuk, P. G., Muratov, E. N., Artemenko, A. G.,
Kowarik, A., & Templ, M. (2016). Imputation with the R Kolumbin, O. G., Muratov, N. N., & Kuz’min, V. E.
package VIM. Journal of Statistical Software, 74(7), (2009). Application of random forest approach to QSAR
1–16.10.18637/jss.v074.i07 https://doi.org/10.18637/jss. prediction of aquatic toxicity. Journal of Chemical
v074.i07 Information and Modeling, 49(11), 2481–2488. https://doi.
Landis, J. R., & Koch, G. G. (1977). An application of hier- org/10.1021/ci900203n
archical kappa-type statistics in the assessment of major- Rockel, T. (2020). Package ’missMethods’. https://cran.r-pro-
ity agreement among multiple observers. Biometrics, ject.org/web/packages/missMethods/missMethods.pdf
33(2), 363–374. https://doi.org/10.2307/2529786 Sabbe, N., Thas, O., & Ottoy, J. P. (2013). EMLasso:
Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact Logistic Lasso with missing data. Statistics in
post-selection inference, with application to the lasso. Medicine,32(18), 3143–3157. https://doi.org/10.1002/sim.
The Annals of Statistics, 44(3), 907–927. https://doi.org/ 5760
10.1214/15-AOS1371 Sahri, Z. B., & Yusof, R. B. (2014). Support vector machine-
Liu, Y., Wang, Y., Feng, Y., & Wall, M. M. (2016). Variable based fault diagnosis of power transformer using k near-
selection and prediction with incomplete high-dimen- est-neighbor imputed DGA dataset. Journal of Computer
sional data. The Annals of Applied Statistics, 10(1), and Communications, 02(09), 22–31. https://doi.org/10.
418–450. https://doi.org/10.1214/15-AOAS899 4236/jcc.2014.29004
Ma, S., & Huang, J. (2008). Penalized feature selection and Schafer, J. L., & Graham, J. W. (2002). Missing data: Our
classification in bioinformatics. Briefings in view of the state of the art. Psychological Methods, 7(2),
Bioinformatics, 9(5), 392–403. https://doi.org/doi.org/10. 147–177. https://doi.org/10.1037/1082-989X.7.2.147
1093/bib/bbn027 https://doi.org/10.1093/bib/bbn027 Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Magis, D., Tuerlinckx, F., & De Boeck, P. (2015). Experimental and quasi-experimental designs for general-
Detectionof differential item functioning using the Lasso ized causal inference. Mifflin.
approach. Journal of Educational and Behavioral Shen, B., Liu, B. D., & Wang, Q. (2016). Elastic net regular-
Statistics, 40(2), 111–135. https://doi.org/10.3102/ ized dictionary learning for image classification.
1076998614559747 Multimedia Tools and Applications, 75(15), 8861–8874.
McNeish, D. M. (2015). Using Lasso for predictor selection https://doi.org/10.1007/s11042-014-2257-y
and to assuage overfitting: A method long overlooked in Shevade, S. K., & Keerthi, S. S. (2003). A simple and effi-
behavioral sciences. Multivariate Behavioral Research, cient algorithm for gene selection using sparse logistic
50(5), 471–484. https://doi.org/10.1080/00273171.2015. regression. Bioinformatics (Oxford, England), 19(17),
1036965 2246–2253. https://doi.org/10.1093/bioinformatics/btg308
16 J. E. YOO AND M. RHO

Shmueli, G. (2010). To explain or to predict? Statistical courses. Journal of Asynchronous Learning Networks,
Science, 25(3), 289–310. https://doi.org/10.1214/10- 13(3), 115–127. https://files.eric.ed.gov/fulltext/EJ862360.
STS330 pdf
Shweta, Bajpai, R., & Chaturvedi, H. (2015). Evaluation of Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., & Lange, K.
inter-rater agreement and inter-rater reliability for obser- (2009). Genome-wide association analysis by Lasso penal-
vational data: An overview of concepts and methods. ized logistic regression. Bioinformatics (Oxford, England),
Journal of the Indian Academy of Applied Psychology, 25(6), 714–721. https://doi.org/10.1093/bioinformatics/
41(3), 20–27. btp041
Smeekes, S., & Wijler, E. (2018). Macroeconomic forecasting Yoo, J. E. (2018). TIMSS 2011 Student and teacher predic-
using penalized regression methods. International Journal
tors for mathematics achievement explored and identified
of Forecasting, 34(3), 408–430. https://doi.org/10.1016/j.
via elastic net. Frontiers in Psychology, 9, 317 https://doi.
ijforecast.2018.01.001
org/10.3389/fpsyg.2018.00317
Song, Q., Shepperd, M., Chen, X., & Liu, J. (2008). Can k-
Yoo, J. E., & Rho, M. (2020). Exploration of predictors for
NN imputation improve the performance of C4.5 with
small software project data sets? A comparative evalu- Korean teacher job satisfaction via a machine learning
ation. Journal of Systems and Software, 81(12), technique, Group Mnet. Frontiers in Psychology, 11, 441.
2361–2370. https://doi.org/10.1016/j.jss.2008.05.008 https://doi.org/10.3389/fpsyg.2020.00441
Sun, J., Chen, Y., Liu, J., Ying, Z., & Xin, T. (2016). Latent Yu, Q., Miche, Y., Eirola, E., Van Heeswijk, M., SeVerin, E.,
variable selection for multidimensional item response & Lendasse, A. (2013). Regularized extreme learning
theory models via L1 regularization. Psychometrika, 81(4), machine for regression with missing data.
921–939. https://doi.org/10.1007/s11336-016-9529-6 Neurocomputing, 102, 45–51. https://doi.org/10.1016/j.
Sun, J., & Ye, Z. (2019). A LASSO-based method for detect- neucom.2012.02.040
ing item-trait patterns of replenished items in multidi- Zahid, F. M., & Heumann, C. (2019). Multiple imputation
mensional computerized adaptive testing. Frontiers in with sequential penalized regression. Statistical Methods
Psychology, 10, 1944. https://doi.org/10.3389/fpsyg.2019. in Medical Research, 28(5), 1311–1327. https://doi.org/10.
01944 1177/0962280218755574
Tibshirani, R. (1996). Regression shrinkage and selection via Zhang, C. H. (2010). Nearly unbiased variable
the Lasso. Journal of the Royal Statistical Society: Series B selection under minimax concave penalty. The Annals of
(Methodological), 58(1), 267–288. https://doi.org/10.1111/ Statistics, 38(2), 894–942. https://doi.org/10.1214/09-
j.2517-6161.1996.tb02080.x AOS729
Tibshirani, R. J., Taylor, J., Lockhart, R., & Tibshirani, R. Zhang, C., & Ma, Y. (Eds.). (2012). Ensemble machine learn-
(2016). Exact post-selection inference for sequential ing: Methods and applications. Springer Science &
regression procedures. Journal of the American Statistical
Business Media.
Association, 111(514), 600–620. https://doi.org/10.1080/
Zhang, K., Yin, F., & Xiong, S. (2014). Comparisons of
01621459.2015.1108848
penalized least squares methods by simulations. arXiv
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,
Hastie, T., Tibshirani, R., Botstein, D., & Altman, preprint arXiv:1405.1796.
R. B. (2001). Missing value estimation methods for Zhao, P., & Yu, B. (2006). On model selection consistency
DNA microarrays. Bioinformatics (Oxford, England), of Lasso. Journal of Machine Learning Research, 7,
17(6), 520–525. https://doi.org/10.1093/bioinformatics/17. 2541–2563.
6.520 Zhou, Q., Chen, W., Song, S., Gardner, J. R., Weinberger,
Uematsu, Y., & Tanaka, S. (2019). High-dimensional macro- K. Q., & Chen, Y. (2015, February 25–30). A reduction of
economic forecasting and variable selection via penalized the elastic net to support vector machines with an
regression. The Econometrics Journal, 22(1), 34–56. application to GPU computing [Paper presentation].
https://doi.org/10.1111/ectj.12117 [Conference presentation] Twenty-Ninth AAAI
Van de Geer, S., B€ uhlmann, P., Ritov, Y. A., & Dezeure, R. Conference on Artificial Intelligence (pp. 3210–3216),
(2014). On asymptotically optimal confidence regions Austin, TX.
and tests for high-dimensional models. The Annals of Zou, H. (2006). The adaptive lasso and its oracle
Statistics, 42(3), 1166–1202. https://doi.org/10.1214/14- properties.Journal of the American Statistical Association,
AOS1221 101(476), 1418–1429. https://doi.org/10.1198/0162145060
Waldmann, P., Meszaros, G., Gredler, B., Fuerst, C., & 00000735
S€olkner, J. (2013). Evaluation of the lasso and the elastic Zou, H., & Hastie, T. (2005). Regularization and variable
net in genome-wide association studies. Front Genet, 4, selection via the elastic net. Journal of the Royal Statistical
270. https://doi.org/10.3389/fgene.2013.00270 Society: Series B (Statistical Methodology), 67(2), 301–320.
Wang, X., Nan, B., Zhu, J., Koeppe, R., & Frey, K. (2017).
https://doi.org/10.1111/j.1467-9868.2005.00503.x
Classification of ADNI PET images via regularized 3D
functional data analysis. Biostatistics and epidemi-
ology,1(1), 3–19. https://doi.org/10.1080/24709360.2017. Appendix: R codes
1280213
Willging, P. A., & Johnson, S. D. (2009). Factors that The R codes of this study can be found at https://github.
influence students’ decision to drop out of online com/MJRoh/penRegmiss.

You might also like