Seleccion of Regressors in Econometrics Non Parametrics & Parametrics Methods - Lavergne

This article was downloaded by: [University of Windsor]
On: 19 November 2014, At: 11:19

Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
Econometric Reviews
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/lecr20
Selection of regressors in econometrics: parametric

and nonparametric methods selection of regressors in
econometrics
a
Pascal Lavergne
a
INRA-ESR, BP 27 31326 Castanet-Tolosan Cedex France E-mail:
Published online: 21 Mar 2007.
To cite this article: Pascal Lavergne (1998) Selection of regressors in econometrics: parametric and nonparametric methods
selection of regressors in econometrics, Econometric Reviews, 17:3, 227-273, DOI: 10.1080/07474939808800415
To link to this article: http://dx.doi.org/10.1080/07474939808800415
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of
the Content. Any opinions and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied
upon and should be independently verified with primary sources of information. Taylor and Francis shall not be
liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of
the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
ECONOMETRIC REVIEWS, 17(3), 227-273 (1998)
SELECTION OF REGRESSORS
IN ECONOMETRICS: PARAMETRIC
AND NONPARAMETRIC METHODS
Pascal Lavergne
INRA-ESR, B P 27
31326 CASTANET-TOLOSAN Cedex FRANCE.
E m a i l : [email protected]
Downloaded by [University of Windsor] at 11:19 19 November 2014
Keywords: Selection of regressors, Discrimination.
JEL classification: Primary C52; Secondary C20.
ABSTRACT
The present paper addresses the selection-of-regressors issue into a general dis-
crimination framework. We show how this framework is useful in unifying various
procedures for selecting regressors and helpful in understanding the different strate-
gies underlying these procedures. We review selection of regressors in linear, non-
linear and nonparametric regression models. In each case we successively consider
model selection criteria and hypothesis testing procedures.
1 Introduction
This paper presents a general framework for model discrimination and its use for
selecting explanatory variables in regression models. The focus on the selection-of-
regressors issue is motivated by several considerations. On a pedagogical viewpoint,
the statistical procedures are more easily exposed in the setting of regressors choice.
Copyright 0 1998 by Marcel Dekker, Inc.

22 8 LAVERGNE
This allows to focus on the main features of the approach. But, more importantly,
selecting regressors is one of the foremost issues in empirical modelling and arises in
the simplest linear model as well as in general nonparametric regression models. The
matter is even more acute in economics than in experimental sciences, as economic
theory often suggests a large collection of potential explanatory variables among
which one would like to discriminate. This motivates the large and long-standing
econometric literature focusing on this problem.
It must be first recognized that the selection-of-regressors issue can occur from
qualitatively different purposes. When an econometrician has data on a number of
potential explanatory variables, his goals may be as various as (i) the knowledge of
exactly which variables are relevant in explaining the dependent variable, (ii) the
simplification of a general complex model, (iii) the precise estimation of some pa-
rameters of interest, (iv) the predictive ability of the model or simply (v) its ease
of interpretation. More importantly, a number of economic hypotheses correspond

to particular restrictions on the initial set of regressors. In production economet-
rics, for instance, the constant returns-to-scale hypothesis is tested hy assessing the
relevance of the output level in the unit cost function. Similarly, tests for equality
of coefficients across populations or periods of time can be viewed as discriminating
between competing sets of regressors. Furthermore, it often occurs in econometrics
that some variables may be not uniquely defined, for instance income or money s u p
ply. Others are not observable and use of different proxies may be considered. In
these cases, the analyst faces different sensible sets of regressors from which he wants
to choose the most appropriate one for further analysis. Lastly, an econometrician
may be confronted with competing economic theories that lead to regression models
with different explanatory variables. For instance, keynesian and monetary theory
leads to nonnested models for explaining disposable income or unemployment rate,
see e.g. Friedman and Meiselman (1963), McAleer and McKenzie (1989).
In view of this diversity in modelling situations, flexibility must be an essential
feature of any model selection approach. The discrimination framework we adopt
covers a wide range of problems: in their monograph, Linhart and Zucchini (1986)
devote eleven chapters t o various situations, from density estimation to ARIMA
models. Actually, it is difficult to think of a selection problem that cannot be ac-
comodated in this framework. Another major interest of this approach lies in its
generality. Indeed the discrimination framework unifies model choice rules based on
popular criteria and model selection tests. As we will show, it sheds light on the
most well-known selection procedures from the econometrician toolbox. Moreover, it
SELECTION OF REGRESSORS IN ECONOMETRICS
may help in suggesting new ones. The third notable quality of the framework is that
it leads to a better understanding of the particular strategy underlying any practi-
cal selection procedure and clarifies the implicit assumptions needed for obtaining
desirable properties.
More crucially, the discrimination approach can take into account misspecifi-
cation of the models under consideration. This is a key feature for application in
econometrics. Indeed, it is now widely recognized that an economic model (. . .) is
in fact only a more or less crude approximation to whatever might be the "true" rela-
tionships among the observed data, rather than providing an accurate description of
either the actual economic or probabilistic relationships (White, 1994). In addition,
the framework easily extends to models that depend on some infinite dimensional
parameters. Therefore it can be adapted to selection among semiparametric and
nonparametric models that are becoming popular in econometrics.
The selection of regressors issue is a perfect instance where the variety and
the richness of the discrimination approach can be exemplified. In the aim t o give
an extensive picture, we expose some very popular selection procedures and others
that illustrate unusual applications of the approach. Though, it is clearly impossible
to be exhaustive. In particular, we intentionally adopt a frequentist viewpoint and
omit procedures based on Bayesian arguments, such as Schwarz's (1978) information
criterion or Jeffreys' posterior odds, see Zellner (1971). However, the discrimination
framework is really a decision-theoretic one, and does not seem t o conflict with a
bayesian viewpoint. We also confine ourselves to i.i.d. contexts and do not cope
with time series (only in a few occasions reference is made to order determination
in autoregressive models). Finally, we mainly focus on the asymptotic properties of
the procedures and do not discuss their small sample properties or situations where
the number of candidate regressors is large with respect to the sample size. We hope
however that this review will answer some of the common questions of practitionners
and will be a useful guide for further reading.
Our review is organized as follows. In Section 2, we present a general framework
for discrimination by mean of a discrepancy. The following three sections are re-
spectively devoted to linear, nonlinear and nonparametric regression models. In each
section we consider successively model selection criteria and hypothesis testing pro-
cedures, and for the latter, we distinguish between nested and nonnested situations.
For general properties of parametric estimation methods and testing procedures, as
well as regularity conditions for their validity, the reader can refer t o Newey and
McFadden (1994). We focus on two particular discrepancies, the Gauss discrepancy
LAVERGNE
and the Kullback-Leibler discrepancy, because most of the literature on regression

models' selection uses one of these two discrepancy measures as a basis for discrim-
ination. This is not to say that they are the only usable ones, see Section 3.2.1 for
an instance of a procedure based on a different discrepancy. In Section 3, we detail
how various popular criteria and testing procedures for linear normal models can be
derived from the choice of the Gauss discrepancy. In Section 4, we deal with general
nonlinear parametric models; we first consider the Gauss discrepancy and discuss
cross-validation and bootstrap methods for deriving selection criteria; we then con-
sider the Kullback-Leibler discrepancy and derive the main selection procedures; we
also contrast the dicrimination approach with specification analysis, in particular
with respect to the comparison of nonnested models - the nonnested situation is
indeed one where the specificities of the two approaches are most apparent. In Sec-
tion 5 , we consider choice of regressors in nonparametric models and we present
selection procedures based on the Gauss discrepancy. Because this literature is still
in its infancy, much cited work is still unpublished and the related bibliography is
of a more tentative nature. The Conclusion is devoted to more general comments.
2 Discrimination by mean of a discrepancy

Models The first step in the discrimination framework is t o distinguish between
different concepts of models that are used in the modelling process.' At the outset,
the practitionner have some knowledge about the subject under investigation that
may imply some specific information on the observations, such as the non-negativity
of some variables, the independence of some events, . ..However, it is only in excep
tional cases that sufficient information is available to fully specify the unknown data
generating process (hereafter DGP) which governs the observations. One can only
circumscribe a family of probability distributions including this DGP, and generally
this family cannot be appropriately used to fit the data, because either it is too
complex with respect to the available data or one does not have enough a priori
information.
Thus one has to define an approximating family of probability distributions
that will be used to fit the data, and that define the approximating model. In a
parametric setup, the approximating family is denoted by G e , of generic element
'Our presentation of the discrimination approach is largely inspired by Linhart and Zucchini
(1986).
SELECTION OF REGRESSORS IN ECONOMETRICS 23 1
Go, 6 being a parameter in a subset O of RP. An approximating model is labelled

as correctly specified if it contains the unknown DGP.
The estimated model, which is the base of inferences about the economic phe-
nomenon, is an element of Ge denoted G? The value ?t generally depends on the
sample through an estimation process. For the sake of simplicity, we denote in the
same way this value and the estimator as a function from the sample to O.
Discrepancy The second step of the approach is to choose a discrepancy function

which specifies in what sense models are close t o each other. A discrepancy function
is naturally defined over a set of probability distribution^.^ To each pair of prob-
ability distributions, it associates a real number, which measures the dissimilarity
between these distributions. Specifically, let M be a set of probability distributions,
a discrepancy function A is such that
Without loss of generality, the discrepancy function may be chosen as nonnegative

and such that A(F, F ) = 0,V F E M. However, the discrepancy function is generally
not a distance and need not be even symmetric in its arguments. Indeed, it is par-
ticularly valuable from a selection viewpoint that the discrepancy gives a directional
measure of the dissimilarity of a model with respect to a benchmark. In the dis-
crimination approach, each (approximating or estimated) model is compared t o the
DGP. Therefore, the lack of fit of a particular (approximating or estimated) model
is measured by its discrepancy with the benchmark that constitutes the DGP, and
competing models are subsequently compared with respect to these values. Specif-
ically, the discrepancy measure due t o approximation is defined as the minimum
discrepancy between an element of the approximating family and the DGP Fo,that
is A(6') = A(Ge*,Fo), where '8 = argmin (A(G0, Fo) ;8E O). The distribution
G p corresponds to the best approximating distribution for the family of distribu-
tions Ge and the discrepancy function A. The discrepancy due to approximation
can be interpreted as a loss incurred from the approximation of the unknown DGP.
It typically decreases as the complexity of the approximating family grows (with the
dimension of 6). Similarly, the discrepancy measure due t o estimation is A(Gp G p ) .
It is the loss due to our ignorance of 6' and to data limitation and it increases with
2~ependingof the context, the discrepancy can be defined over unconditional or conditional (to
some explanatory variables) distributions.
232 LAVERGNE
the complexity of the approximating family. Lastly, the overall discrepancy mea-
sure is A(@ = A(G;, Fo) and gauges the loss incurred from both approximation
and estimation. In many cases, the overall discrepancy is simply the sum of the
two previous measures, but even if it is not, when the complexity of Gs increases,
the two effects - improvement of approximation and deterioration in estimation -
generally act in opposition to each other.
Strategy The choice of the discrepancy function determines which aspect of the
model is important for the practitionner.3 The particular measure that serves as a
basis of comparison is driven by the goal of the selection process. One possibility is
to compare competing approximating models through their associated discrepancy
measure due to approximation A(08). This arises in applications where we want
to select the approximating family which potentially gives the best fit, irrespective
of whether this can be achieved with a given sample, e.g. when we confront two
competing economic theories or in pilot studies to larger experiments. Another
possibility is to compare competing fitting procedures, which consist in pairs of an
approximating family and an estimation method. Because the overall discrepancy
measure A(@ heavily depends on the particular sample at hand, it may be too haz-
ardous to use it. As we usually want t o favour estimated models which, on average,
result in low overall discrepancies, the selection can be based on the expected overall
discrepancy E [ ~ ( g ).'] If the same estimation method is used in each fitting proce-
dure, we will effectively select among competing approximating families. Because of
the two competing effects previously mentioned, it can occur that in finite samples
the expected overall discrepancy is minimum for an approximating family that is
not correctly specified, even when such a model is among the considered ones. On
the other hand, if the fitting procedures differ only through the estimation method,
we will select among competing estimator^.^
Implementation Analytical derivation of the dicrepancy measure can lead to

insurmountable difficulties in some situations, and even if we obtain such an expres-
3There may be several aspects of importance. Dealing with a multidimensional discrepancy

would be possible. However, this can result in an incomplete ordering of the models.
4The symbol E [.] denotes expectation with respect to any random event included inside the
bracket. The expectation is conditional if the discrepancy is.
5This is another nice feature of this framework to include the choice of the estimation method
as well. Conversely, Eubank (1988, Chapter 2) includes variable selection into a general framework
primarily designed for choosing between estimators.
SELECTION OF REGRESSORS IN ECONOMETRICS 233
sion, it will still depend on unknown parameters from the DGP and will have to
be estimated. For implementation of the procedure, we thus need an estimator of
the chosen discrepancy measure, which we call a criterion. To define this criterion,
we may rely on finite sample properties or resort to either asymptotic arguments
or resampling schemes. Though each problem requires its own method of analysis,
a usual device is t o use the empirical distribution in place of the unknown Fo,so
that A(Gp Fo) is estimated by A(Gp F,). Similarly, a natural related choice for the
A
estimator 6' is the so-called minimum empirical discrepancy estimator

-
6' = arg min (A(G6, Fn);6' E O) .
However, it may happen that the computed estimator is not the minimum empirical
discrepancy estimator corresponding to the criterion used for judging lack of fit.
This arises for instance when one uses one part of the sample for estimation and
another for comparing competing models.
Criteria-based rules: parsimony and asymptotic optimality The selection

among competing models can be done by comparing the criteria corresponding to
each approximating family and selecting the model with the smallest criterion. If
criteria are consistent, this should result in choosing a model with minimum theoret-
ical discrepancy measure with probability approaching one asymptotically. However
it is likely that many models leads to this minimum. For instance, this is true for
any approximating model that includes the DGP. Therefore parsimony should be
introduced so as to favor models of smaller dimen~ion.~
We label as "optimal" a
model with minimum discrepancy and minimum dimension. Looking for an optimal
model usually highly reduces the number of potential candidates, though it may
not be unique. Asymptotic optimality (or consistency) of the selection rule will be
reached if the procedure select an optimal model with probability approaching one
asymptotically. Now consistency of the criterion is generally not sufficient t o obtain
consitency of the procedure, so that this asymptotic optimality must be studied on
its own.
Discrimination tests Alternatively, if the (finitesample or asymptotic) distri-

bution of the criteria can be derived, it is possible to take into account the sample
'In a parametric setup, the dimension of an approximating model Go is usually the one of O.
For selection of regressors, this is the dimension of the regressors' set.
234 LAVERGNE
variability of the criteria. For comparing two approximating models Gel and Go,,
the null hypothesis tested is
against
H1: a(0;) < A(0,.)
Rejection of Ho in favor of H1 or H2 indicates which model dominates the other

according to the discrepancy measure. An important feature of this framework is
that the two competing models are treated symmetrically and that the resulting
tests are directional. Also this framework is quite general as it does not make an
explicit distinction between nested and nonnested situations. Another approach is

to take HoU HI (say) as the null hypothesis and Hz as the alternative. This would
introduce directly an asymmetry between the two competing models, cimilar to the
one found in the standard Neyman-Pearson theory of testing in the nested case.
As a matter of fact, this naturally occurs in a nested situation when Gel c Ge,,
as H 1 cannot occur by definition of A(@*).Were the null hypothesis not rejected,
one typically retains the smallest model. That is, when the two competing models
are equivalent with respect to the discrepancy measure, one invokes parsimony so
as t o discriminate between the competing models. Similar hypotheses based on the
overall expected discrepancy can be equally considered. However, when comparing
nested models, the above remark does not hold, and in general there is no natural
asymmetry between the models.
Discrimination versus specification tests As MacKinnon (1983) points out,

while it is customary i n the econometric literature to treat model selection tests and
non-nested hypothesis tests as being very closely related, and even as rival proce-
dures, that is a very misleading point of view, as they are not tailored for the same
objectives. In specification analysis, the usual null hypothesis of interest is the
correct specification of one of the models, and the models under consideration are
treated in an asymmetric way. In contrast, in the discrimination approach, the null
hypothesis is that both models approximate equally well the unknown DGP, and
the two models have a symmetric role. Moreover, discrimination tests are generally
speaking based on the properties under the DGP and do not need to assume the
correctness of any of the approximating r n ~ d e l s . ~
The two type of tests can be equivalent only if the implicit null and alternative
hypotheses are indentical. This is the case when comparing two nested models Gel
and Go, with respect to their discrepancy due to approximation and assuming the
correctness of the general model Ge,. While the null hypothesis of the specification
test is the correctness of Go,, the null hypothesis of the discrimination test also
reduces here t o the correctness of the restricted model. Indeed, the correctness of
Gez means that the DGP belongs to this family, i.e. A(823) = 0, SO that the null
hypothesis A(0;) = A(0;) is equivalent to A(@;) = 0, i.e. the correctness of Go,. It
will be seen in our review that most populars tests for nested hypotheses are both
specification and discrimination tests. This special equivalence property, holding in
a well-known but particular situation, is certainly one of the main reasons of the
above-mentioned confusion. However, there is no general equivalence between the

two kind of tests, that typically differ in both the considered (null and alternative)
hypotheses and the maintained assumptions.
3 Selection of regressors in linear models

Let y be a (n x 1) vector of observations and X be a (n x k) matrix of rank k
containing fixed regressors including the constant term. We consider a family of
probability distributions
Model 0: y = XP + u,
where u denote a general (n x 1) vector of i.i.d. normal residuals. The DGP is
then characterized by unknown values ,& and 0; of the parameters. Consider now
a partition of X in X p and X, with dimension p and r (= k - p) respectively. The
considered approximating models will be of the general form
The dimension p may vary along different approximating models, and there may also
exists many approximating models with p regressors from k, but we will use this
generic form and called such a model a p-model.' Note that the complete k-model
itself is often one of the considered competing approximating models.
'If such an assumption is done, the strategy can nevertheless be used with qualifications.
'One can also consider more general formulations embodying linear combinations of variables.
236 LAVERGNE
The discrepancy usually considered in this setting is the Gauss discrepancy
Ac(GelFo) = EF~
[YO - EG~(YOIX~)J~.
Here yo is a (n x 1) vector generated as but independently of y, EF, and EGO
denote expectations with respect to the DGP and with respect to an element of the
approximating family respectively, and the notation v2 means V'V.~ In case of linear
approximating models, the discrepancy due to approximation is
2
E F ~[YO - xPP;] = min {EF, [YO - ~ ~ ;Pp4E R p~) . 1 ~
The expected overall discrepancy is
EFO[YO - xPpp]
It is easily seen that XpP,* = PPX& where Pp = Xp(XLXp)-'Xi is the projection
matrix on the subspace spanned by Xp. The minimum empirical Gauss discrepancy
estimator of 4 is here the ordinary least-squares estimator pp = (XLXp)-'X;y,

whose finite-sample properties are well-known. Note that there is no minimum
empirical discrepancy estimator for the residual variance 5;2 in the pmodel, because
it does not play any role in the conditional expectation of y given Xp.
3.1 Criteria in linear models

Numerous procedures for selection of regressors are based on the predictive ability
of a model, measured through the in-sample mean square error of prediction
where Mp = Ip - Pp is the projection matrix on the subspace orthogonal to the

subspace spanned by Xp (see the Appendix for derivation). The M S E P is the
expected overall discrepancy measure. It splits into a variance term (the variance of
the prediction) and a squared bias term. Choosing the model that minimizes M S E P
does not necessarily lead to retain all pertinent variables (corresponding to a non-
zero coefficient in the DGP). It may be the case that in finite samples the MSEP
is minimized for an incomplete model if the reduction in the prediction variance
outweighs the squared bias term due to the omission of some relevant variables.
T h e Prediction Criterion Let us assume for a moment that the restricted p

model is correct. Then the bias term vanishes so that an unbiased estimator of the
'The vector yo is introduced to formdy distinguish between the discrepancy, defined indepen-
dently from the observations, and the criteria, which rely on them.
M S E P is
n+pRSSp = -
-- PC
n n-p n '
where RSSp is the residual sum of squares for the estimated pmodel. The Prediction
Criterion PC is found in Rothman (1968),Hocking (1972) and Amemiya (1980).
These authors propose t o compare the value of PC across the competing pmodels.
It is clear from what preceeds that one assumes correctness of the pmodel under
consideration to derive the criterion. Therefore, the selection procedure comes to
consider each model as correct. This strategy has been labelled as optimistic by
Amemiya (1980). It is indeed really optimistic to assume that every considered
model is correct, but we will see that other rules are from the same vein.
Mallows' Cp Assuming that the DGP belongs to Model 0, we can derive an un-
biased estimator of the M S E P . Indeed, we have
We then obtain the criterion

RSS,
n
+ --
2 p RSSk
n n-k'
Mallows' criterion (1973) is a monotonic transformation of this quantity defined as
: = R S S k / ( n- k ) . This criterion is then coherently derived from a strategy

with 5
based on the M S E P and assuming that the DGP comes from Model 0.
Hocking's Sp A more specific criterion is found if we assume that ( X ,y ) has a

joint normal distribution. In this case, we can consider the expectation Ex [ M S E P ]
(where E x denotes the expectation with respect to X). An unbiased estimator of
this quantity is
(n- I)S, = ( - RSS~.
( n - p ) ( n - p - 1)
See Thompson (1978)for a proof. The criterion Sp is usually accredited to Hocking
(1976),himself refering to Tukey (1967). It may be found surprising that the crite-
rion does not depend upon the RSS for Model 0. This is because in this particular
gaussian setting, any restricted model is also correct, i.e. any conditional expecta-
tion of y given X p is linear in Xp. Not surprisingly, it is found that the use of PC
238 LAVERGNE
and C p is equivalent to use of S p under the normality assumption, see Kina1 and
Lahiri (1984).
T h e R~ rule Let us consider now the discrepancy due to approximation. For a

pmodel, it is just the residual variance, namely
2
ui2 = E F ~[yo - x~P,']
= 0; + (l/n)h'X'MPX,Do.
The discrepancy measure is then the sum of the residual variance of the DGP and
of the squared part of the regression that is not allowed for by the p-model. This
measure penalizes the dimension of the model less than M S E P in small samples.
If the considered p-model were correct, an unbiased estimator of the residual
variance u;2 in the p-model would be 3; = R S S p / ( n- p). Minimizing 3; across
models is actually equivalent to the maximum Theil's R2 rule
(1957). The strategy
employed here is optimistic, as is the PC rule, for one assumes the correctness
of the pmodel to derive the criterion, and subsequently compares criterion among
models.1° When comparing the optimal subset with a larger one, the probability of
picking up the optimal model is considerably less than unity, because both criteria
have the same expectation, see Ebbeler (1975). This is by no mean surprising. From
our framework, it is clear that minimizing the residual variance does not ensure that
irrelevant variables will be eliminated, but only that pertinent ones will not be
removed.
3.2 Discrimination tests for linear models

3.2.1 Nested models
T h e Fisher test Let us focus on the choice between Model 0 and a restricted
model with p regressors and compare their discrepancies due t o approximation."
The equality
prevents the pmodel from dominating the complete one. Therefore the null hypoth-
esis of interest is
2
Ho : E [Yo - xpoj2 = E [YO - x,P,']
''In the general case, we get from (2) that 2; is a biased but consistent estimator of a;'.
"From now on, we omit indices in expectation.
and the only alternative is
If the testing procedure does not reject Ho, we will by convention retain the p
model, that is we adopt a parsimony principle for discrimination. By Equation (4),
we obtain
Ho u ,f3o'X1MPXPo = 0
This hypothesis can be tested via the least-squares estimator of P in the general
model by the statistic
which under Model 0 has a noncentral Fisher distribution with r (= k - p ) and n - k

degrees of freedom and with noncentrality parameter X = PoX1MPXPo/(2a~).

As
Ho is equivalent to X = 0, the statistic is compared to a central F(k - p, n - k),
which is the standard Fisher test.
Links with selection criteria Discriminating between Model 0 and a pmodel

by any of the criteria discussed in the previous section is actually equivalent to a
test based on the Fisher statistic F. The corresponding critical values F(.) vary
with the sample size, but can be ordered as F ( R ~ )< F(PC) < F(Cp) < F(Sp),
see e.g. Amemiya (1980). This shows that some of these criteria favours parsimony
more than others. These remarks are however of limited scope, as they rely on an
hypothesis testing procedure for two nested models only.
More generally, one may wonder if sequential use of hypothesis testing can help
in choosing a best subset. One advantage is to restrict the number of considered
models. Indeed, with k regressors in the general model, there exist 2k-1 possible
models (when by convention the constant term is included in every model). To
avoid such a tedious examination, stepwise methods have been proposed, where
inclusion or deletion of a variable is based on values of a battery of F-statistics.
The most popular ones include (i) forward selection, which starts with the constant
term and adds one variable a t a time (ii) backward elimination, which starts with
the general model and eliminates one variable a t a time, and (iii) Efroymsom's
procedure (1960), which basically is a forward selection, where a t each step the
possibility of deleting a variable as in backward elimination is considered. These
and other variants are reviewed in Hocking (1976), Thompson (1978) and in Miller's
240 LAVERGNE
monograph (1990). Nothing unfortunately ensures that the optimal subset, or even
the relevant variables, will be selected by mean of one of these methods.12 But in
practice, stepwise methods may prove useful when the sample size is relatively small
compared with the number of potential regressors.
Wallace's test If we consider now the overall expected discrepancy, the null hy-
pothesis of interest is
against
i ~ - ]xp12.
Ha : E[yo - ~ ~ f> E[yo ~
Now because
the null hypothesis is equivalent to
where pp = (pp,O) and the regressors have been suitably ordered. Thus, the null
hypothesis is that the restricted estimator is better in weak mean squared error than
the unrestricted one, as termed by Wallace (1972). He shows that a necessary and
sufficient condition for HO to hold is that X 5 ( ~ 1 2 )It. follows that HO can be tested
by mean of the F-statistic. Goodnight and Wallace (1972) provide tables for such a
test.
A test based on a different discrepancy The following testing procedure pro-

vides an example where the discrepancy criterion is not directly related to the Gauss
dicrepancy, though the estimators considered (least-squares estimators) are.
Toro-Vizcarrondo and Wallace (1968) argue that in some situations one may
want to use a constrained estimator even when the restriction is not valid. A con-
strained estimator has indeed a smaller variance, and one might be willing to make
a trade-off, accepting some bias in order t o reduce variances. The possibility of
trade-off can be captured via the concept of mean squared error, so that we prefer
the restricted estimator $, if
121n their defense, it should be noted that optimality was never claimed by the originators of
stepwise methods.
~ ~ ~ ( l ' fIi MSE(Z'~)

,) for every non random vector 1 # 0.
This is equivalent to the requirement that the matrix

-
E [(P - PO)( j- Po)'] -E [(a- po)(B, - po)']
should be positive semi-definite. Toro-Vizcarrondo and Wallace show that this holds
if and only if X I 1/2, so that again a test can be based on the classical F statistic
and critical values given in Wallace and Toro-Vizcarrondo (1969).
3.2.2 Nonnested models
Consider now two linear nonnested models defined as

The regressors sets may have some common variables W plus some idiosyncratic
ones Z1 and Z2. This situation may arise for instance when the practitioner can use
two different proxies for the same unobservable variable.
Hotteling's approach Comparing the discrepancy due to approximation associ-

ated with each of the model leads to consider
The null hypothesis can actually be interpreted as the one that the expectation of
yo (i.e. of y) lies a t an equal distance from the space spanned by X1 and the space
spanned by X2. Hotelling (1940) was the first t o propose a testing procedure for
Ho in the case of linear univariate regression models. Chow (1980) generalizes this
procedure to the case of multidimensional X1 and X2 which are disjoint (apart from
the constant term). From (4), we have the equivalence
where X = (W, Z1, Z2)is the union of X1 and X2 and Mi = I - Xi(XiXi)-'X,', i =

1,2. Chow notes that this quadratic constraint is equivalent t o the linear constraint
Po'X1 [MI - M2] XP = 0 (to a second order approximation), where ,8 is the least-
squares estimator of p*. This can be tested via an F-test in the comprehensive model.
Namely, if fi is the constrained estimator of P* under the above linear constraint,
then the statistic
242 LAVERGNE
F= I xP - xb 112
ll Y - xfi l12/(n - k)
is approximately distributed as a central F(1,n - k) under the null hypothesis.
Efron's test Comparing the overall expected discrepancy associated with each of
the model leads to the null hypothesis
H o : E[YO ~ E[VO-X~W]~.
-XIW]=
Efron (1984) built a testing procedure from the criteria (3) and a nonparametric
bootstrap procedure that approximates the distribution of their difference. He re-
lates his approach to Hotelling's one in the univariate case and also gives geometrical
interpretation of the comparison between discrepancies.
Efron's test, as well as Hotelling's procedure, does not require any of the com-
peting models to be correctly specified, i.e. to include the unknown DGP. Lien
and Vuong (1987) propose a testing procedure based on the likelihood ratio for
nonnested linear normal models, that also allows for misspecification of both models
under consideration.
4 Selection of regressors in nonlinear models

4.1 Discrimination by the Gauss discrepancy
Let us consider a family of parameterized approximating models, that relate a ran-
dom variable Y to some explanatory variables X of dimension k, and defined as
Y = r(X,4) + U, E [UIX] = 0 , var [ U ~ I X=] o2

The simplest example is given by
'(X, P) = G ' P ) ,
so that imposing nullity of some parameters amounts t o selecting some subset of X .
In other instances, the framework may be more general than a choice of variables
problem. The variables in X can be either fixed or random. The Gauss discrepancy
writes
EFO - r ( x , p)I2.
The discrepancy due to approximation is
The expected overall discrepancy is
EFO [y - Wp^)I2
With iids observations {(yi, Xi), i = 1,.. ., n) from the process (Y, X ) , the minimum
empirical Gauss discrepancy estimator of /I is
the nonlinear least-squares estimator.13

The DGP is assumed to be of the form
Y = ro(X) + Uo, E [UIX] = 0, var [ U ~ I X=] 0;.

Hence, the DGP is not generally assumed to be one of the approximating models.
The expected overall discrepancy then writes
+
[Y - ro(x)12 EF, [ro(x) - r ( x , P ) ]
M S E P = E F ~[Y - r(x,p^)I2= E F ~
2
= O;+EF. [RI(x)-~(x,@]~ (6)
In a nonlinear context, it is not easy to go further, even when assuming a specific

parametric form for ro(.). Two methods are then particularly useful: cross-validation
schemes and bootstrap procedures. These methods, though, have been initially
proposed for linear models.
Sample reuse methods Maybe the oldest idea found in the literature is t o split
the sample in two subsamples, using the first for estimation and the second for
model selection. A refinement is then introduced as the leave-one-out method: one
computes 8-; on the (n- 1)sample obtained by omitting observation i and compute
the associated predicted residual, i.e. yi minus its predicted value estimated by
r(Xi, p i ) . The cross-validation criterion is just the corresponding mean of squared
predicted residuals
CV = ( l l n ) xn
i=l
[yi - r(xi,b i ) ]
2
= ( l l n )PRESS,
Minimizing PRESS (Prediction Sum of Squares) or C V across competing models

was first proposed by Allen (1974) and Stone (1974) for linear models.
13For the sake of simplicity, we do not explicitely consider heteroscedastic models and weighted
nonlinear least-squares.
244 LAVERGNE
More elaborate cross-validation schemes can be considered. Some authors, see

e.g. Geisser (1975) and Burman (1989), propose to use a delete-d cross-validation
procedure, which is a straightforward generalization of Allen's delete-one cross-
validation. The validity of the criterion relies on the fact that the delete-d CV
is an unbiased estimator of the M S E P for a sample of size n - d, see e.g. Linhart
and Zucchini (1986).
-
Another well-known method is the generalized cross-validation, introduced by
sion, i.e of the form i

-
Craven and Wahba (1979). It is actually valid for any linear estimator of the regres-
1
[ r ( x I ,,@,. . ., r ( ~ , , p ) ] = H(y1,. . ., y,)', and writes
2
(I/.) [yi - r(xi1 PI]
EL:=,
GCV =
Kl/n)tr(I - ~ 1 1 ~
For a linear p-model, H = P, and t r ( I - H) = n-p, so that GCV is almost identical
to nS,.
A noteworthy advantage of cross-validation devices is that they allow t o es-
timate the prediction error without particular assumptions on the unknown DGP.
This is particularly valuable, even with respect to classical criteria in the linear
model.
Asymptotic optimality: case of a fixed DGP Let us assume that the DGP
is among the considered approximating models with corresponding value Do. For
simplicity of exposition, we consider selection of variables among models of the form
(5) and distinguish two categories of models. Category I gathers models such that
at least one nonzero component of Po is not in ,f3, while models in Category I1 are
such that ,f3 contains all nonzero components of Po. The models in Category I are
incorrect models, while some models in Category I1 may be inefficient because of
their unnecessarily large sizes. It is desirable that the selection rule asymptotically
selects an optimal model, a model with minimal dimension in Category 11.
In terms of M S E P , models in Category I differ from those in Category I1 by
at least Elb [ r o ( x ) - r ( X , ,@] , which under suitable conditions tends to a positive
2
limit. Hence the C V criterion, as a consistent estimator of MSEP, has an associ-

ated probability of choosing an incorrect model that tends to zero. However, models
in Category I1 asymptotically all have a M S E P independent of the dimension of the
regressors set and equal t o a;. Hence, in this setting, use of C V is not an a s y m p
totically optimal rule as it tends to choose larger models with positive asymptotic
probability, see Shao (1993) and Zhang (1993). More precisely, these rules are opti-
ma1 only if the DGP coincides with the larger model, as in this case Category I1 is
a singleton.
Let us explain further why the C V rule is not optimal. The deleted C V
estimates the M S E P based on a sample of size (n - d), rather than n. Now what is
central for selection is t o accurately estimate the differences in M S E P across models.
Between models of different categories, this is not difficult as their difference is an
0 ( 1 ) , because of a bias term in the second term of (6). However, among models
in Category 11, this difference is an O(n - d)-', for the second term in (6) is a
variance term, see e.g. Equation (1). But the error in estimation is typically of
order n-' and the usual C V rule with fixed d therefore fails to distinguish between
models in Category 11. Shao (1993) show that inconsistency of the leave-one-out
cross-validation can be rectified by using a leave-d-out cross-validation with d, the
number of observations reserved for validation, satisfying d/n + 1 and n - d + ca
as n + co. In this way, one is estimating the M S E P based on a sample of size

(n - d), with error in estimation of order n-', which is asymptotically smaller
than the order of differences in M S E P for Category 11's models, i.e. (n - dl-'.
Practically, this cross-validation scheme can be performed for a sufficiently large
but incomplete collection of subsets of size d: this is called the balanced incomplete
CV. Alternatively, Shao proposes to reduce the computational burden involved in
the procedure by using a Monte-Carlo CV. The balanced incomplete C V method
remains asymptotically consistent when the fixed DGP is not part of the considered
approximating models. Moreover, the method can be in principle used in backward
or forward selection algorithms.
Asymptotic optimality: case of a varying DGP Stone (1979) notes that,

though the usual form of asymptotic studies is t o let n goes to infinity in a fixed
DGP, it may be more realistic to consider more complex DGPs as the sample size
grows. Asymptotic optimality of model selection criteria is critically affected when
this kind of asymptotic analysis is adopted, for when the dimension of the DGP
(i.e. the dimension of X ) is infinite or increases with the sample size, models under
consideration are all just approximations to the true DGP.
In the context of normal linear models, Shibata (1981) shows that PC and
are optimal. Li (1987) proves further asymptotic equivalence of C,, rules and cross-
validation rules (CV and GCV) for homoscedastic nonlinear models. As shown
by Andrews (1991), the asymptotic optimality of C V extends to heteroscedastic
models, but it is not the case for GCV.
246 LAVERGNE
Breiman and Freedman (1983) also consider a setup where the DGP is a linear
model with an infinity of ordered regressors and show that use of S, asymptotically
leads to retain the subset minimizing Ex [ M S E P ] .
B o o t s t r a p m e t h o d s in selection Bootstrap methods provide a simple and ef-

fective means of circumventing the problems encountered in deriving an estimator of
M S E P . In our context, there are mainly two ways of generating bootstrap samples:
Unconditional bootstrap: Generate bootstrap samples {(Xf, yl), i = 1,.. ., n)

from the empirical distribution of the original pairs {(Xi, y;), i = 1, . . . , n).
0 Conditional (or residual) bootstrap: Let Qi be the i-th residual from the esti-
mated full model. Generate {uT, i = I , . . . n ) from the distribution that puts
mass n-l on each Q i / J m , i = 1 , . . ., n. Consider bootstrap samples
= Xi, y: = r(Xi,bo)+ u;,) i = 1,.. ., n), where POis the estimator of

A
{(XI
Po in the full correct model.
Unconditional (or paired) bootstrap seems more approriate for random regressors,
though it can also be used in the deterministic case. Let be the least-squares
estimator on each bootstrap sample. Efron (1993) derives a bootstrap estimate of
M S E P as
RSS
= - +n E { ( / n ) i=l
[ y i - r ~ , ) ] 2 - ( i=l
/ n [ y - r ~ , ]
where RSS = x:=,

[yi - r(Xi, p)]2
and E* is the expectation with respect to boot-
strap sampling. This estimator is almost unbiased, but the associated rule is equiv-
alent to a leaveone-out cross-validation procedure, so that it is too conservative in
the fixed DGP case for the reasons explained above.
Shao (1996) shows the asymptotic optimality of a modified bootstrap selection
procedure where one generates bootstrap samples of size m, with m + co and
m/n + 0 as n + co. The simpler estimator
is minimized over the competing models. For the conditional bootstrap, the method
may not be easy to implement unless there is a special structure in the regressors
values. The modified procedure consists in bootstrapping n residuals from a dis-
tribution that puts mass n-' on each fi i = 1,. . .,n. As in
the cross-validation scheme of Shao (1993), these bootstrap schemes allow to get an
error in estimation of M S E P that is of smaller order than the difference in M S E P .
The choice of m in the bootstrap procedure (as well as the choice of d in the balanced
incomplete CV) clearly needs to be investigated further.
Breiman (1992, 1996) proposes a different method. Building on the fact that
E [y] + = 0
: EF. [ ~ o ( x )- r(x,,@I2 + 2 E p ~[uo ( ~ o ( x )- r(x,,@)] ,
an estimator of M S E P can be constructed as
where Bp is a suitable estimator of EFo[ u o ~ ( x , ~ ) ] . " The following bootstrap

method is then applied. Generate { E ; , i = 1,.. . , n) as an i.i.d. sample from N(0, t2i?i),
$02
where comes from the estimation of the complete model (assumed to be correct)
and t is chosen as to be "small" (between 0.6 and 1). Consider bootstrap samples
{(Xi, y: = yi + ~ i ) ,i = 1, . . ., n)}, from these compute 2 = [r (XI,p),. . ., r (x,,, @ )It
and
1
Bp = liE* [ E , PI.
The procedure is labelled little bootstrap. When the limit of Bp when t + 0 exists,
it is called the tiny bootstrap. Breiman (1992, 1996) investigates the properties of
these procedures in several Monte-Carlo experiments (see also Breiman and Spector,
1992, for the X-random case).
4.2 Discrimination based on the Kullback-Leibler discrepancy

We now consider the choice among models defined by parameterized conditional
densities. The approximating families Ge are indexed by a generic density g(ylx, 8).
The discrepancy most frequently associated with this setting is the Kullback-Leibler
discrepancy, formally defined as
where fo characterizes the unknown DGP. We can consider the equivalent monotonic
transformation Ejo[logg(ylx,e)]. The discreparicy due to approximation then writes
Efo[logg(ylx,871 = max{Efo [logg(ylx, 611 ;6 E Q}

and the overall expected discrepancy writes El, [ 1 0 ~ ~ ( ~ l x , 6With
) ] . iids obsena-
14Note the similarity with CP7sconstruction.

248 LAVERGNE
tions {(yi, Xi),i = 1, . . . , n) from the process (Y, X ) , the corresponding minimum
discrepancy estimator is the maximum likelihood (ML) estimator,
n
5 = arg max {LE (8) ; 0 E O) , where Li(8) = log g(yi(Xi,8).
i=l
It is useful for what follows to recall the asymptotic properties of the ML estimator,
see White (1982). Its probability limit is called the pseudo-true value of the param-
eter 0 defined as 19' = argmaxs Ejo [logg(ylx,O)]. The asymptotic distribution of
is given by
J;;(5 - e*) 4 N (0,R - ~ ( B * ) C ( B * ) R(ex))
-~ , (7)
where
When the DGP is included in the approximating family, i.e. fo = g(y(x,80) for some
Bo, then 6" = do and the information matrix equivalence S2(e0) = C(Bo) holds, so
that Jii (5- 00) N (0, R(Bo)).
4.2.1 Likelihood-based criteria
In nonlinear contexts, asymptotic properties are in general the only ones avalaible.
Definition of criteria will then naturally rely on asymptotic approximations. Under
standard regularity conditions, Linhart and Zucchini (1986) show (see our Appendix
for a sketch of the proof) that
E pogg(ylz, B*)] r; E [n-l~;(W] - ( 1 / 2 ) n - l t m - l ( ~ * ) ~ ( ~ * )

and (8)
E [ I O ~ ~ ( ~ ~ Zr;, @E] [ n - 1 ~ : ( @ ] - n - 1 t r ~ - 1 ( 8 * ) ~ ( 9 * ) .
Well-known criteria are subsequently derived by replacing the unknown expectations

with their sample analogs.
Akaike's Information Criterion Assume that the approximating model con-

tains the DGP, that is g(ylx, e*) = fo. Thus the information matrix equivalence
holds so that trR-1(6'*)C(B*) = trI, = p, and a possible criterion for Ejo [logg(ylx,$)]
- (pln). Akaike's (1973) Information Criterion is defined as
is given by ( l / n ) ~ g , ( @
minus twice the latter quantity, i.e
AIC = -(2/n)L9,(87,) + 2(p/n).

It reduces (up to a monotonic transformation) to (RSSp/n) exp(2pln) for the linear

standard model. Similarly to the a2or PC criteria, Akaike's criterion is optimistic as
it is derived under the assumption that the approximating model under consideration
is correctly specified.15
Sawa's Information Criterion For estimating Ej0 [log g(ylx, $)I, it can be pos-
sible to compute the value of trW1(t9*)C(t9*)if one assumes some particular form for
the DGP. Consider for instance the linear Model 0 of Section 2 and a linear normal
approximating pmodel, then
where is the residual variance of the pmodel. Sawa (1978) replaces the un-
known residual variances a: and a;2 by their ML estimators 5
: and i?; and gets the
Information Criterion
Chow's extension Chow (1981) treats the general case where the information
matrix equivalence does not hold. As R(t9') and C(t9') can be estimated by their
sample analogs 6(8)and g(@,without any parametric assumption on the DGP, he
proposes the criterion
L;(Q - (1/n)tr6-'(6j5($),
to estimate El, [ ~ o ~ ~ ( ~ l x In
, @the
] . case of linear normal models, a monotonic
transformation of Chow's criterion gives Sawa's BIC.
Other criteria Though based on differents methods than prediction-based criteria

in linear models, likelihood-based criteria are similarly built. A term is "added" to
the objective function used to estimate the parameters of the model so as t o penalize
model complexity. Similarly, other criteria could be built through approximations
of EJo[log g(ylx, P ) ] . From Equations (a), it is easily seen that they would write
as the sum of L$(&) plus a penalty term, which is half the penalty term attached
with the preceding criteria. Thus such criteria would penalize complexity less than
the classical ones.
15~kaike(1973) actually imposes that the estimated model is not too far from the unknown
DGP,by mean of a condition that is quite ambiguous. For a detailled derivation of AIC under the
assumption of a correct specification, see Amemiya (1980).
250 LAVERGNE
Sample reuses approaches are also suggested by Geisser and Eddy (1979), but
have been little investigated in the litterature. The leave-one-out criterion is shown
to be asymptotically equivalent to AIC by Stone (1977) under the assumption of a
fixed DGP.
Asymptotic optimality Information criteria has been first studied in the context
of time series, and specifically for determining the order of an autoregression. In the
context of a fixed DGP, Shibata (1976) show that AIC tends t o overestimate the
number of lags, unless the DGP corresponds to the maximum order. Alternatively,
Shibata (1980) show optimality of AIC when the order of the autoregression is
infinite.16 For a fixed DGP, Hannan and Quinn (1979) show that (almost sure)
optimality is reached for the modified information criteria
log 8; + 2 (p/n) c log log n,

for any c > 1, where 8; is the ML variance estimator in the autoregression of order
p (the first term corresponds to - ( 2 / n ) ~ i ( 6 ~ ) ) .
In the context of independence sampling, Nishii (1988) copes with a situation
where any model under consideration may be misspecified. He studies the optimality
of selection procedures based on penalized likelihood criteria of the form
He shows that the penalty attached with the dimension of the model cn should
satisfy
Cn
lim
n+cc
5
n
=0 and lim
n+m
-
log log n
- +oo
to ensure almost sure consistency while
lim
n+cc
-
Cn
n
=0 and lim c - +oo
n+cc -
is sufficient for weak consistency. Sin and White (1996) generalizes this work in
several directions and weakened the conditions found by Nishii. In particular, they
consider dependent and heterogeneous processes. For weak consistency, the condi-
tions are similar. For strong consistency between two models, when assuming that
one is correct, one can take exploit the information matrix equivalence and recovers
Hannan and Quinn's result. Otherwise, the conditions on the penalty term gener-
1 6 ~ h e sresults
e are analogous to the ones reported in Section 4.1.
ally vary depending on the two models being compared. The penalty term is then
interpreted as corresponding t o a critical value in a testing procedure.
4.2.2 Hypothesis testing procedures
Tests for nested models Consider two approximating models G1 and G2 such
that G I C G2. Within the discrimination framework, an hypothesis testing proce-
dure aims to compare the discrepancy measures associated with two approximating
models. By properties of a discrepancy, the discrepancy due to approximation can-
not be greater for G I , so that we shall test the null hypothesis
It turns out that the null hypothesis is equivalent to gl (yls, 8;) = g2(ylx,O;), see
Vuong (1989a). Moreover, it often reduces to some linear or nonlinear restrictions
on the parameters of the form HO : ~ ( 8 3=
) 0.
One of the most popular method for testing such an hypothesis is the Wald
test, based on the statistic
where ?(&) is an estimator of the asymptotic variance of &. If one assumes further
that G2 contains the DGP, so that g2(yls,8;) = fo, then one can use ?(a2) = fi(e^,).
The resulting statistic W is asymptotically distributed as a central x2 with r degrees
of freedom under HO. Other procedures include the Lagrange multiplier test and
the Hausman-Wald test, see e.g. White (1994). They are asymptotically equivalent
to the Wald test under the null hypothesis.
However, it is possible to build a test for No without the optimistic view that the
largest approximating model contains the DGP. Specifically, the asymptotic variance
of 8 is given by (7). White (1982) derives robust versions of the Wald and Lagrange
- A - A
multiplier tests, that use the variance estimator ?(@= ~ ~ - ' ( B ) C ( O ) R - ' ( ~These
).
are robust to misspecification in the sense that it is not assumed that the approx-
imating models are correctly specified. As their classical analogs, robust tests are
asymptotically equivalent and distributed as a x 2 ( r ) under HO.
These tests also share the particularity of not testing the initial hypothesis Ho,
but some equivalent form as HO. The initial hypothesis of equal entropies can be
LAVERGNE
tested through the empirical likelihood ratio
-.
In the simple case of no misspecification, the statistic 2 LR,(&, 82) is asymptotically
distributed as a x ~ ( T ) under Ho, and is asymptotically equivalent to t.he classical
Wald statistic. However, for the case of two possibly misspecified approximating
models, this asymptotic equivalence does not generally hold. Specifically, Vuong
(1989a) shows that under Ho, 2 LR,(&, $22) converges in distribution to a weighted
sum of independent chi-squares. Moreover a test based on this statistic is consistent
against Ha.It is only when the information matrix equivalence holds for the larger
model that the test statistic has the usual x 2 ( r ) null asymptotic distribution.
Tests for nonnested models Before presenting testing procedures associated

with the discrimination framework, we will give an overview of specification and
encompassing tests against nonnested alternatives. This topic has been developped
much more than discrimination tests in the literature, and related tests are very
well-known. AS' we argue, while specification and encompassing tests, that we label
non discrimination tests, are often viewed as competitors to discrimination tests,
this view is particularly misleading.
Non-discrimination tests Two approximating models G1 and G2 are usu-

ally said to be nonnested if any of them do not correspond t o some restrictions on
. the other one, that is if G1 $ G2 and G2 $ GI.
Artificial nesting In the likelihood context, artificial nesting is defined through

the densities of each considered model. Quandt (1972) proposed to consider
an artificial mixture of the two models of generic form (1 - X)gl(ylx, 81) +
Xg2(ylz,02), X being the probability that an observation is generated by Model
2. Another framework developped by Atkinson (1970) is based on a geometric
mean of the densities, i.e. a density proportional t o [gl (y Is, el)]' Isz(yIz, ~9~)]'-'
Practical problems arise in particular because the estimation procedure entails
a high dimensional optimization problem. More importantly some parameters
are not identified in the polar cases corresponding to X = 1 and X = 0.
For regression models, we can consider a simpler mixture model where the
regression function is a convex combination of the two initial regressions. Thus
with two regression models
Ml:y=rl(X~)+Ul and M2:y=r2(Xz)+U2,
we can associate the nesting regression model
While all parameters are yet not identifiable in this model, a specification test
of M1 can be entertained by a two-step procedure as proposed by Davidson and
MacKinnon (1981): first estimate Model M2, then test X = 1 in the nesting
regression model where 7-2 has been replaced by its estimate. If the procedure
does not reject X = 1, Model M1 will be validated.
Cox's test Another general specification test for testing Ml relies on the modified
likelihood ratio
where 8 2 is an estimator of O2 and where the probability limit is evaluated under

Model MI. Substitution of any locally fi-consistent estimator of f12 under
the null hypothesis yields an asymptotically normal statistic under the null
hypothesis of the correct specification of MI. Cox (1961,1962) uses the usual
ML estimate.17 Atkinson (1970) uses an estimate of the pseudo-true value of
e2 under estimated Model 1, where the pseudo-true value of e2 associated with
a is A
@;(el)= a% maxEgl(v,& [~~(YIx. 9211.
Encompassing Cox's procedure is a generalization of the likelihood ratio test. It

is also possible to extend other classical tests to the nonnested case by the
encompassing principle, which states that the chosen model should explain the
results obtained by competing models, see Hendry and Richard (1982). The
encompassing Wald test, for instance, considers the difference between the
ML estimate g2 and @(&), the estimated pseudo-true value of 82 associated
with &. The encompassing principle is fairly general and can be applied to
nonnested situations as well as to nested ones, where there are often found to
be similar t o classical tests.'' Gourieroux, Monfort and Trognon (1983) detail
"Pesaran (1974) gives detailled computations for Cox's test in the case of nornested linear
regression models.
''For instance, the Fisher test in an embedding model is actually an encompassing-type test on
the conditional mean parameters, see Mizon and Richard (1986).
254 LAVERGNE
properties of encompassing Wald and score tests. While the encompassing

principle does not explicitely restrict the DGP, the distribution of the test
statistics has been first derived under the assumption that one of the models
is correctly specified. Gourieroux and Monfort (1995) derive robust versions
of these tests that allow for misspecification of both models.
Discussion As stressed in Section 2, the non-discrimination tests treat the

models in an asymmetric way. In the artificial nesting framework or in Cox's test,
the null hypothesis of interest is the correct specification of the model under con-
sideration and the test uses the evidence provided by a nonnested alternative.lg In
the encompassing view, the null hypothesis is that the model under consideration
can explain the results given by a nonnested alternative. In practice, because of
the asymmetry introduced between the two models, using any of these tests t o dis-
criminate between models should rely on a pair of tests. It is possible that the
procedure does not come to a clear-cut decision if the tests reject both or neither of
the null hypotheses. In contrast, in the discrimination approach, the null hypothesis
is that both models approximate equally well the unknown DGP.The two models
have a symmetric role, and the outcome of the selection test indicates if one of the
competing models, and which one if any, dominates the other one.
The divergence between dicrimination and specification tests can be simply
illustrated using Atkinson's coumpound model. In specification analysis, one tests
successively that the mixing parameter X is equal to zero or one. In contrast the
discrimination procedure would be based on testing that the value of the mixing
parameter X is equal to a mid-point ,A corresponding to similar degrees of fit of the
models. Atkinson (1970) first suggests the latter approach.20 He proposes the value
of ,A = 112 for the mid-point, though this particular value is strongly criticized
for its arbitrariness in the discussion of Atkinson's paper. Indeed, the value of the
mid-point should depend on the competing models. On this ground, it is possible
to build procedures that test if the separate models provide similar degrees of fit t o
the data, see Vuong (1989b) for some related results.
-- - -
''That is, the two hypotheses Hf and H, are considered unsymrnetri&lly, the hypothesis H,
serving only to indicate the type of alternative for which high power is required. Cox (1962, p. 407).
201nhis article, Atkinson formally distinguishes the specification problem from the discrimination
problem and separately investigates each of them.
A discrimination test The discrimination approach does not require the

use of a comprehensive model. The null hypothesis of interest is simply that both
models are equally good in approximating the unknown DGP against the alternative
hypotheses that one model dominates the other according to the chosen discrepancy
measure. Building on our framework, we consider
against the alternative hypotheses

Here only three outcomes are possible. Moreover a consistent test ensures that if
one model dominates the other, it will be chosen with asymptotic probability one.
This is in particular the case when the DGP is included in one of the model. Thus,
though models are generally assumed misspecified at the outset, the procedure will
actually select the correct model if one of the models under consideration is correctly
specified.
As the empirical likelihood ratio is a consistent estimator of the difference
in discrepancies, it is a natural statistic to test Ho against Hi and Hz. Vuong
(1989a) shows that the behavior of this statistic depends on the pseudetrue val-
ues Of = arg max El, [loggi(ylx, &)I, i = 1,2. Specifically, under general regularity
conditions, we have:
- 6
(ii) If gl(y(x,8;) = g2(y(x,8;), 2 LR,(&, d2) converges in distribution to a weighted

sum of independent chi-squares.
Therefore, the rate of convergence of the LR statistic under Ho differs according t o

whether one of the best approximating distributions is included in the competing
approximating model. When the two approximating models are nested, G I C G2
and the null hypothesis is equivalent to gl (ylx, 8;) = g2(ylx,8;). We will then use
256 LAVERGNE
the asymptotic distribution of the likelihood-ratio in (ii) to build a test of Ho,a s

seen in Section 4.2.1.
In contrast, in the case of nonnested models, we do not know if (i) or (ii) is to
be used. It may well be that for two nonnested models (in Cox's sense) the likeli-
hood ratio has a Jn-degenerate asymptotic distribution, as does the Cox statistic
for nested or "orthogonal" hypotheses, see Gourieroux and Monfort (1994). This
occurs if the best approximating distribution gl(ylx, 0;) from GI is an element of
the competing approximating family G2 (or reversely if gz(ylz, 6';) E GI). In this
situation, a discrimination test between Model 1 and Model 2 is equivalent to a test
of the intersection of the two models against Model 2 (respectively Model I), i.e. a
test of nested models. So in general it is not possible to say a priori which limiting
distribution should be used.21
However, gl (ylx, Of) = g2(yIx,8;) is equivalent to w2 = 0 under Ho,so that
a two-step procedure can be entertained as proposed by Vuong. First test the

hypothesis w2 = 0. If it is not rejected, then stop the procedure as gl(y(x, 8;) =
g2(y]x,85) implies the equality of discrepancies. If the nullity of w2 is rejected, then
one can use the normal distribution in (i) to build a directional testing procedure.
5 Selection of regressors in nonparametric models

Consider a setting where the interest focuses on the link between a univariate variable
Y and a function g(.) of a multivariate variable X , where g(.) is not parameterized
and is only restricted t o belong to a set of smooth squared integrable functions G.
The Gauss discrepancy between Y and g(X)is defined by
where expectation is evaluated with respect to the joint distribution of (Y, X ) . We

could similarly consider E [Y - g ( ~ ) ]w(X),
2 where w(.) is an indicator function
which serves to restrict attention on some intervals if necessary. Taking expectations
with respect to the pair (Y, X ) is a t odds with the current practice in parametric
situations where expectations are considered conditionnally on X-values. However,
dealing with random regressors seems more relevant in econometrics, where data
seldom come from controlled experiments. Beside their flexibility, most nonpara-
Here necessary and sufficient conditions under which the asymptotic fi-variance of the test
statistic is null are precisely determined. In Cox's test, the non-nullity of the &asymptotic
variance is a maintained hypothesis, see White (1982).
metric estimation methods have the additional advantage to take this peculiarity
into account.22
The discrepancy due to approximation is
min E [Y - g ( ~ ) ]=
2 E [Y - r(x)12,
gEP
where r(.) = E [Y IX = .] is the regression function of Y on X . The above property

shows that regression models are naturally related to the Gauss discrepancy. In
this framework, discriminating among different sets of regressors with respect to
the discrepancy due to approximation therefore amounts to compare their residual
variance E [ Y - r ( x ) I 2 . This has nothing to do with the fact that the residual is
normally distributed or homoscedastic conditionally on X . It simply comes from
the definition of the discrepancy and the properties of the regression function.
Discrimination with respect to the overall expected discrepancy amounts t o
compare
+
E [Y - r,(x)12 = E [Y- r(x)12 E [rn(X) - r(x)12
across competing regresions, where r,(.) is some consistent estimator of r(-).23 This
penalizes complexity as E [r,(X) - r ( x ) I 2 generally depends on the dimension of
X . The asymptotic equivalence of the two discrepancies measures is ensured if the
last term converges to zero, i.e. r, converges in L2 to the regression function. But
this is exactly the consistency concept used in nonparametric estimation, see e.g.
Stone (1977). More surprising is that this consistency fails to hold even for the
least-squares estimator in the X-random case (ibid.).
Most cited work use kernel estimators. These estimators depend on the choice
of a kernel function K ( . ) and of an asymptotically vanishing smoothing parameter
h, called the bandwidth. Kernel estimators of regression functions typically have a
rate of convergence lower than parametric ones and equal to a
+ co,where p
is the dimension of X . We refer to Bierens (1987) or Hardle (1990) for a detailled
presentation and properties of kernel estimators.
5.1 Nonparametric criteria
Criteria and goodness-of-fit measures Let us denote r,,;(.) the regression es-
timator in which the ithobservation is left out. A natural cross-validation criterion
''However some nonparametric estimation methods have been explicitely tailored to deal with
fixed regressors.
23Stone (1977) note this equality for nearest-neighbor estimators. It is easily seen that it is
generally true.
LAVERGNE
for nonparametric models is given by
where w(Xi) is a known weight function with compact support (included in the
support of the regressors density). Cheng and Tong (1993) studies the first-order
asymptotic expansion of C V as an estimator of residual variance in a general time
series context. Based on C V , one can build goodness-of-fit measures that estimate
the fraction of the total variance of Y explained by the nonparametric regression on
X . Doksum and Samarov (1995) propose
where 6 = n-' Cy=lw(Xi), P = n-' CL1J$w(Xi) and S$ = n-' Cy="=,(Yi- F ) 2 ~ ( X i )

They study its first order-expansion as well as variants of it. They also propose es-
timates of the measure of relative importance of a subset Xp of X, i.e.
where rp(.) = E [ Y I X , = -1.

Alternatively, Lavergne and Vuong (1996a) consider an integral-type estimator
of the residual variance. They study the statistic
where v,(.) and fn (a) are respectively a nonparametric estimator of the conditional
variance of Y upon X and of the density of X . They show that Tn is consis-
tent in L1 for the residual variance under general conditions allowing for dependent
observations and for several nonparametric methods. They subsequently derive a
nonparametric finite-sample analysis-of-variance decomposition, which leads t o con-
sider
as a goodness-of-fit measure with range from zero to one. Similarly, one can derive
consistent estimates of yp based on integral estimators.
In the context of autoregression, Auestad and Tjostheim (1994) propose cross-
validation estimates of the expected overall discrepancy E [Y - r,(x)12 and suggest
a procedure to select the number of significant lags.
Asymptotic optirnality Let us define zn

= argminh, CV(hn). It is known that
the solution ^hn is asymptotically optimal for the distance E w ( X ) [r,(X) - r(x)12,
see Hall (1984), Hardle et Marron (1985). A natural procedure is therefore to com-
pare the values of the C V ( ~ , criterion
) among competing regression models. P r o p
erties of this procedure have been studied in the i.i.d. continuous case by Zhang
(1991) and Vieu (1994). The latter proves that the procedure asymptotically selects
an optimal subset and deals with the general case where it may exist more than
one optimal subset. In this work, an optimal subset Xp, with associated estimated
regression rP,,(.), from the initial one X is defined as a subset that minimizes the
distance
d ( x , xp) = 1 [T(x) - rp,n(xp)12w ( x ) f ( x ) d x
and that has minimal dimension.
Other work gives similar results in more general settings, but does not allow
for automatic choice of the bandwidth. See Yao and Tong (1994) for dependent
observations, and Cheng and Tong (1992) and Vieu (1996) for autoregression order
choice.
5.2 Nonparametric hypothesis testing procedures

Within the discrimination framework, the hypotheses of interest are
Nonrejection of the null hypothesis Ho means that models cannot be discriminated

according to the overall expected Gauss discrepancy. Rejection of Ho in favor of
either HI or H2 indicates which model has the smaller residual variance.
5.2.1 T h e nested case
Consider now two nested sets, with X2 = (XI, 2 ) . Then clearly E [Y - r2I2 <
E [Y - r112, so that the only possible alternative to Ho is Hz. When Ho is not
rejected, one typically retains the smallest model, i.e. one invokes parsimony so as
to discriminate between the two competing models. Testing Ho could be entertained
by using differences in mean of squared residuals. But this difference has a fi-
degenerate limit distribution under Ho in the nested case. Because of the degeneracy
260 LAVERGNE
problem, other null hypotheses have been considered. Some researchers consider
In the nested case, this hypothesis is intuitive and meaningful. To see that Hois an
equivalent formulation of the null hypothesis Ho,note that
When X2 = (XI, Z2), the second term is zero. Hence
Robinson (1989) proposes finite conditional moment (CM) tests, as developped by

Newey (1985) and Tauchen (1985), in semiparametric and nonparametric models (on
this topic, see also White and Hong, 1993). He suggests that his tests can serve for
testing the relevance of some regressors. The resulting tests however, are designed
to evaluate the condition E [(Y - rl(X1))Q(X2)] = 0, with known Q(.). This is
implied by, but not equivalent to H0.Gozalo's test (1993) is a particular CM test
based on the squared differences between the two estimated regression functions a t
some fixed points (xli,xzi). The testing procedure is made consistent against all
alternatives by randomly searching over values (xli, xzi) while letting their number
increases a t an approriate rate. The main weakness of random search algorithms
is that, for finite and even large sample, the distribution of the statistic may be
far away from its theoretical limit, leading to overrejections, see the Monte-Carlo
experiments in Gozalo (1993). Lewbel (1995) proposes to take as a test statistic the
supremum of a growing collection of CM tests statistics t o obtain consistency against
all alternatives. The asymptotic null distribution is then approximated through a
bootstrap procedure. Lewbel do not explicitely consider selection of regressors in
his paper but (as pointed out by a referee) his procedure seems applicable to this
problem. Alternatively, Samarov (1991) suggests t o test
which is equivalent to I f o He proposes an estimator of this expectation, but also

encounters the degeneracy problem under the null hypothesis. He does not study
further this problem and then fails to propose a testing procedure for omission of
variables.
Different devices may be used to bypass the degeneracy problem. Yatchew

(1992) splits the sample and estimate each model on a subsample. The difference is
shown to have a normal nondegenerate limiting distribution. A disadvantage of this
technique is that one looses sample information. Hidalgo (1992) weighs one of the
(squared residuals) mean by random normal numbers. In doing so, the probability
space is extended t o avoid degeneracy. Moreover, as the variance of the random
weights has no effect on the resulting test, it can be chosen arbitrarily small, leading
to small discriminatory power.
Recent work directly deals with the degeneracy problem of usual statistics and
exploits it to propose a statistic with rate of convergence higher than f i .It considers
a moment condition of the form
where w(-) is a positive weight function. This null hypothesis is indeed equivalent to
H~as soon as the support of the regressors is included in the one of w(.). Estimation
of the latter expectation through kernel estimators has been studied by several
authors. All these estimators share a common rate of convergence, namely nh;:i2,
where p2 = dim X 2 and hz,, is the bandwidth used in the estimation of r2(X2).Aiit-
Sahalia, Bickel and Stoker (1994) and Gozalo (1995) independently study empirical
analogs of the type
where rjn(.) is the kernel estimator of rj(.) with bandwidth hjtn. They encounter
a situation already mentionned in Samarov (1991), namely that the squared bias
of the estimator dominates its variance, so that a test requires a bias correction.
Fan and Li (1996) and Lavergne and Vuong (1995) note that E[(rz - rl)2w(X2)]=
E [(Y - rl)E(Y - rllX2)w(X2)]. When w(X2) is the density of X2, Fan and Li
(1996) propose the sum-type estimator
where Kn(.) = h;:l<(./hzn). The resulting test statistic does not require any bias
correction, has nh;?i2-rate of convergence and is asymptotically normal. A properly
rescaled version of this statistic provides a consistent one-sided normal test of Ho.24
'¶Li (1997) extends the procedure to time series models.

262 LAVERGNE
However, Fan and Li's procedure generally impose oversmoothing of the null regres-
sion model relative to the complete one more than implied by the dimensionnality
of the regressors' sets. Lavergne and Vuong (1995) propose a modification of this
test statistic with a similar behavior that has not this drawback. They also show
that the test has non-trivial power against local alternatives of the type
For testing the significance of discrete regressors, Lavergne (1997) proposes an anal-
ogous testing procedure. In this case, similar amount of smoothing is applied to the
null regression model and to the complete one.
5.2.2 T h e nonnested case

A discrimination test We say that two sets of regressors X1 and X 2 are nonnested
if none of these is nested in the other one. For testing Ho,Lavergne and Vuong
(1996b) consider the criteria
n
C [Y;- rn,i(xi)12I [fn(Xi) > bn]
MSEn = ( l / n )
i=l
where I [fn(Xi) > bn] is a stochastic trimming function, depending on an a s y m p
totically vanishing bn, and fn(.) is the kernel estimator of the density of X . They
show that the difference between criteria MSEin, i = 1,2,has an asymptotic fi-
normal distribution. Moreover they show that under Ho,this limiting distribution
is degenerate if and only if
E[r2(X2)1x11= r l (XI)
or
E[r1(X1)1x21= r2(X2).
We are then back to a situation very similar to the one already encountered in
likelihood contexts. When the two sets of regressors are nested, i.e. X 2 = (XI, Z2),
the fi-asymptotic distribution of MSE2, - MSErn cannot be used to built a test,
because (D) automatically holds, as E [E(YIX2)1X1] = E ( Y 1x1). One should then
consider a higher rate of convergence than J?t. In the case of nonnested competing
sets of regressors, i.e. X I = (W, 21) and X2 = (W, Zz), (D) does not always hold.
In case where (D) does not hold, Lavergne and Vuong provide a testing procedure
for Ho.But it may be that e.g. E(YIX1) = E(YIW),so that the comparison of
nonnested sets X1 and X2 is equivalent t o the comparison of nested sets W and X2.
Testing such an occurence then requires a test for nested models.
Non-discrimination tests Before concluding, it is interesting to examine how

other nonnested hypotheses testing procedures, namely those based on artificial
nesting and encompassing, have been or could be adapted to the nonparametric
context. Consider first the artificial nesting approach. A natural comprehensive
model is given by a regression model with the union of X1 and X2 as the regressors
set. Then one can apply two nested tests so as to compare the discrepancies between
the comprehensive and the first model, and between the comprehensive and the
second model. This is equivalent to testing the equality of regression functions
and
Alternatively, Delgado, Li and Stengos (1995) consider the mixture regression model
as an extension of Davidson and MacKinnon's nesting model and develop a specifi-

cation testing procedure of H1against ~ 2 .
The encompassing principle can also be generalized to nonparametric regression

models. It suffices to note that pseudo-true values are generally defined as minimizers
of the chosen discrepancy between the two considered models. In our framework,
the pseudetrue value of the regression on X2 given Model 1 is
see Gourieroux and Monfort (1994) for a similar definition in semiparametric regres-
sion models. Clearly the solution of this problem is E [E (Y 1x2) (XI].
Model 2 will
then encompass Model 1 if
that is if the regression function of Model 1 can be recovered by the regression

function in Model 2. In particular, if the regressors set X1 is nested in X2, Model
2 automatically encompasses Model 1. Moreover, ~2 implies He, so that the first
approach appears as a particular case of encompassing. Bontemps and Florens
(1996) propose a procedure for the hypothesis in the case where XI and X2 do
not overlap.
The differences between discrimination and non-discrimination tests that we
stressed in a parametric context remain true. In all the three situations above,
264 LAVERGNE
choice between X1 and X 2 would require two tests, that would lead to four different
outcomes. Moreover, it can happen that neither HI nor HZ holds, or that none of the
models encompasses the competing one, so that such a procedure would not allow to
discriminate between the competing models. It is also noteworthy that either H I or
H2, and H, or its counterpart, each implies (D). Therefore, Lavergne and Vuong's
(1996b) testing procedure applies in situations where neither the artificial nesting
approach nor the encompassing approach could be applied.
6 Conclusion
In the previous sections, we have used the discrimination framework to unify selec-
tion of regressors in parametric or nonparametric regression models, and to derive
model selection criteria as well as discrimination testing procedures. This framework
fulfills at least two major requirements for a model selection approach in economet-
rics, as recommended by Nakamura, Nakamura and Duleep (1990). First, one of
the "maintained hypothesis" of the approach is that the models among which one
wants to discriminate are all approximating models and may not contain the true
data generating process. This is in line with the modern econometric practice which
considers that virtually all economic models are approximations and misspecified to
some greater or lesser degree. Model selection criteria and tests can then be derived
without any parametric assumption on the DGP. The approach thus allows to select
not only among misspecified models, but also among correct models, in the sense
that several models can be correct a t the same time, see the case of linear mod-
els with multivariate normal variables. Selection among nonparametric regression
models also comes t o selection among correct models since each represents the con-
ditional expectation of the dependent variable given a particular set of explanatory
variables.
Second, the approach allows explicit reference t o the modelling goals in the
discrimination strategy, through the choice of the discrepancy measure. For instance,
the interest may focus on the predictive ability of the model, see e.g West (1996), or
on the mean squared error of estimation for parameters. The approach can in some
cases serve for the choice of the estimation method, but it is not akin to an estimation
framework. Recent work has considered the case where the discrepancy used for
discrimination is different from the one used in estimation, see e.g. Rivers and
SELECTION OF REGRESSORS IN ECONOMETRICS
Vuong (1990), Sin and White (1996), so that estimation and selection are explicitely
treated as separate problems.25
The discrimination approach has been long brought into conflict with specifi-
cation analysis. We hope that this review article has made clear that they are not
potential rivals. Specification tests aim to check if a model is not inconsistent with
some specific aspects of the data. However, it does not provide any basis for judging
the costs, in terms of goodness of the approximation, of the specification problems
that are detected. Discrimination procedures aim to select the best (or the less
worst) of the models in terms of the chosen measure, but does not provide any indi-
cation on possible misspecifications of the preferred model. Confrontation between
the two approaches shed light on a particular feature of discrimination tests. While
in a parametric context, nonnested models are defined through the approximating
families in specification analysis, see Pesaran (1987), they are defined through the
best approximating families in the dicrimination approach and therefore one should
test if models are nested or not, see Vuong (1989a). In a nonparametric context, we
get a similar situation.
The general discrimination framework should be particularly useful for model
selection in econometrics, both on theoretical and practical grounds. From a theoret-
ical viewpoint, it allows t o unify the various procedures and may help in proposing
new ones. The overview given in this paper can suggest some directions for research
on this domain. In particular, some recent work focuses on application of resampling
techniques, as cross-validation or bootstraping, t o model selection criteria (see Shao,
1993 and 1996). Use of these method for implementation of discrimination testing
procedures still largely need t o be study. The literature about selection of regresors
in nonparametric models is also quite recent. Much remains t o be done both from
a theoretical and a data analytic viewpoint, as detailled in Auestad and Tjostheim
(1994). For the most part of our review, we have deliberately dealt with i.i.d. con-
texts. A considerable work has been done in the field of time series, but model
selection under nonstationnarity has been relatively less studied, see e.g. Potscher
(1989) and the references therein. It is also worth mentioning that multiple compari-
son techniques may prove useful in a context where one has usually t o select among a
collection of potential models, on this see Gupta and Panchapakesan (1979). Maybe
"See Learner (1983) for a critical viewpoint on confusion between model selection and estimation
problems.
266 LAVERGNE
t h e most crucial issue is t h e scarcity of general results on model selection by a dis-

crepancy, two recent references on this issue are Rivers and Vuong (1990) and Sin
and White (1996). Clearly, this list cannot be exhaustive, b u t shows t h a t important
issues still need t o be addressed.
O n a practical viewpoint, implementation can vary with t h e particular mod-
elling situation, as t h e econometrician can define t h e particular strategy he will use
in view of his goal, of t h e economic problem and t h e d a t a at hand. Different practi-
tionners with different objectives will end up with different indices of goodness-of-fit
corresponding t o different aspects of the model. If necessary, i t is also possible t o
consider various strategies s o as t o examine strengths and weaknesses of alternative
models in different dimensions. In any case, one can avoid a mechanical application
of statistical procedures, t h a t would be unrelated t o t h e intended uses of t h e model.
REFERENCES
AIT-SAHALIA, Y., P. BICKEL and T . M. STOKER (1994): "Goodness-of-fit tests for

regression using kernel methods," University of Chicago.
AKAIKE, H. (1973) : Ynformation theory and an extension of the maximum likelihood
principle," in Proceedings of the Second International Symposium on Infomation Theory,
B.N. Petrov and F. Csaki eds. Akademiai Kiodo: Budapest, pp. 267-281.
ALLEN, D.M. (1974) : T h e relationship between variable selection and data augmentation
and a method for prediction," Technometrics, 16(1), pp. 125-126.
AMEMIYA, T. (1980) : "Selection of regressors," International Economic Review, 21(2),
pp. 331-54.
ANDREWS, D. W. K. (1991): "Asymptotic optimality of generalized CL,cross-validation
and generalized cross-validation in regression with heteroscedastic errors," Journal of Econo-
metrics, 47, pp. 359-377.
ATKINSON, A.C. (1970) : "A method for discriminating between models," (with discussion)
Journal of the Royal Statistical Society, Series B, 32, pp. 323-353.
AUESTAD, B. and D. TJOSTHEIM (1994): "Nonparametric identification of nonlinear
time series: selecting significant lags," Journal of the American Statistical Association, 89
(428), pp. 1410-1419.
BIERENS, H.J. (1987): "Kernel estimators of regression functions," in Advances in Econo-
metrics, ed. by T. Bewley. Cambridge: Cambridge University Press, pp. 99-144.
BONTEMPS, C. and J.P. FLORENS (1996): "A global encompassing criterion for nonpara-
metric regression models," Universit6 des Sciences Sociales, Toulouse.
BREIMAN, L. (1992): "The little bootstrap and other methods for dimensionality selection
in regression: X-fixed prediction error," Journal of the American Statistical Association, 87
(419), pp. 738-754. .
BREIMAN, L. (1996): "Heuristics of instability and stabilization in model selection," Annals

of Statistics, 24(6), pp. 2350-2383.
BREIMAN, L. and D. FREEDMAN (1983) : "How many variables should be entered in a
regression equation 7," Journal of the American Statistical Association, 78(381), pp. 131-
136.
BREIMAN, L. and P. SPECTOR (1992): "Submodel selection and evaluation in regression.
The X-random case," International Statistical Review, 60(3), pp. 291-319.
BURMAN, P. (1989): "A comparative study of ordinary crossvalidation, v-fold cross-
validation, and the repeated learning-testing methods," Biometrika, 76, 3, pp. 503-514.
CHENG, B. and H. TONG (1992) : "On consistent nonparametric order determination and
chaos," Journal of the Royal Statistical Society, Series B, 54(2), pp. 427-474.
CHENG, B. and H. TONG (1993) : "On residuals sums of squares in non-parametric au-
toregression," Stochastic Processes and their Applications, 48, pp. 154-174.
CHOW, G.C. (1980) : "The selection of variates for use in prediction : a generalization
of Hotelling's solution," in Quantitative Econometrics and Development, L.R. Klein, M.
Nerlove and S.C. Tsiang eds. New-York: Academic Press, pp. 105-114.
CHOW, G.C. (1981) : "A comparison of the information and posterior probability criteria
for model selection," Journal of Econometrics, 16(1), pp. 21-33.
COX, D.R. (1961) : "Tests of separate families of hypotheses," in Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of
California Press, vol. 1, pp. 105-123.
COX, D.R. (1962) : "Further results on tests of separate families of hypotheses," Journal
of the Royal Statistical Society, Series B, 24, pp. 406-424.
CRAVEN, P. and G. WAHBA (1979): "Smoothing noisy data with spline functions," Nu-
merische Mathematik, 31, pp. 377-403.
DAVIDSON, R. and J.G. MACKINNON (1981) : "Several tests for model specification in
the presence of alternative hypotheses," Econornetrica, 49, pp. 781-793.
DELGADO, M.A., Q. LI and T. STENGOS (1995): "Nonparametric specification testing
of nonnested econometric models," Universidad Carlos 111, Madrid.
DOKSUM, K. and A. SAMAROV (1995): "Nonparametric estimation of global functionals
and a measure of the explanatory power of covariates in regression," Annals of Statistics,
23 (5), pp. 1443-1473.
LAVERGNE
EBBELER, D.H. (1975) : "On the probability of correct model selection using the maximum
'?I choice criterion," International Economic Review, pp. 516-520.
EFRON, B. (1983): "Estimating the error rate of a prediction rule: improvement on cross-
validation," Journal of the American Statistical Association, 78(382), pp. 316-331.
EFRON, B. (1984): "Comparing non-nested linear models," Journal of the American Sta-
tistical Association, 79 (388), pp. 791-803.
EFROYMSON, M.A. (1960) : "Stepwise regression - a backward and forward look," pre-
sented at Eastern Regional Meetings of the Institute of Mathematical Statistics, Florham
Park, New Jersey.
EUBANK, R.L. (1988): Spline smoothing and nonpammetric regression. New York: Marcel
Dekker.
FAN, Y. and Q. LI (1996): "Consistent model specification tests: omitted variables and
semiparametric functional forms," Econornetrica, 64 (4), pp. 865-890.
FRIEDMAN, M. and D. MEISELMAN (1963) : "The relative stability of monetary velocity

and the investment multiplier in the U S . 1897-1958," in Stabilization policies, Commission
on money and credit. Englewood Cliffs: Prentice Hall.
GEISSER, S. (1975): "The predictive sample reuse method with applications," Journal of
the American Statistical Association, 70, pp. 320-328.
GEISSER, S. and W. F. EDDY (1979): "A predictive approach to model selection," Journal
of the American Statistical Association, 74(365), pp. 153-160.
GOODNIGHT, J.H. and T.D. WALLACE (1972) : " Operational techniques and tables for
making weak MSE tests for restrictions in regressions," Econornetrica, 40(4), pp. 699-709.
GOURIEROUX, C., A. MONFORT and A. TROGNON (1983) : "Testing nested or non-
nested hypotheses," Journal of Econometrics, 21, pp. 83-115.
GOURIEROUX, C. and A. MONFORT (1995) : "Testing, encompassing and simulating

dynamic econometric models," Econometric Theory, 11 (2), pp. 195-228.
GOURIEROUX, C. and A. MONFORT (1994) : "Testing non-nested hypotheses," in Hand-
book of Econometrics , Vol. 4, R.F. Engle and D.L. McFadden editors, pp. 2583-2637.
GOZALO, P.L. (1993) : "A consistent model specification test for nonparametric estimation
of regression functions models," Econometric Theory, 9, pp. 451-477.
GOZALO, P.L. (1995): "Nonparametric specification testing with &local power and boot-
strap critical values," Brown University.
HALL, P. (1984): " Asmptotic Properties of Integrated Square Error and Cross-Validation
for Kernel Estimation of a Regression Function," Zeitschrift fir Wahrscheinlichkeitstheorie
und vewandte Gebiete, 67, pp. 175-196.
HANNAN, E.J. and B.G. QUINN (1979): "The determination of the order of an autore-
gression," Journal of the Royal Statistical Society, Series B, 41 (2), pp. 190-195.
HARDLE, W. (1990): Applied nonjmmmetric regression. Cambridge: Cambridge University
Press.
HARDLE, W. and J.S. MARRON (1985) : "Optimal bandwith selection in nonparametric
regression function estimation," Annals of Statistics, 13(4),pp. 1465-1481.
HENDRY, D.F. and J.-F. RICHARD (1982) : "On the formulation of empirical models in
dynamic econometrics," Journal of Econometrics, 20, pp. 3-33.
HIDALGO, J. (1992) : "A general non-parametric misspecification test," London School of
Economics.
HOCKING, R.R. (1972) : "Criteria for selection of a subset regression : which one should
be used ?," Technometrics, 14(4), pp. 967-970.
HOCKING, R.R. (1976) : "The analysis and selection of variables in multiple regression,"
Biometrics, 32, pp. 1-49.

HOTELLING, H. (1940) : "The selection of variates for use in prediction with some com-
ments on the general problem of nuisance parameters," Annals of Mathematical Statistics,
11, pp. 271-283.
JENRICH, R.I. (1969): "Asymptotic properties of non-linear least squares estimators,"

Annals of Mathematical Statistics, 40 (2), pp. 633-643.
KINAL, T. and K. LAHIRI (1984): "A note on "Selection of regressors"," International
Economic Review, 25(3), pp. 625-629.
LAVERGNE P. (1997): "An equality test across nonparametric regressions," INRA-ESR
Toulouse.
LAVERGNE, P. and Q. VUONG (1995) : "Nonparametric significance testing," INRA-ESR
Toulouse.
LAVERGNE, P. and Q. VUONG (1996a) : "An integral estimator of residual variance
and a measure of explanatory power of variables in nonparametric regression," INRA-ESR
Toulouse.
LAVERGNE, P. and Q.H. VUONG (1996b) : "Nonparametric selection of regressors: the
nonnested case," Econometrics, 64 (I), 207-219.
LEAMER, E.E. (1983) : "Model choice and specification analysis," in Handbook of Econo-
metrics, ed. by Z. Griliches and M.D. Intriligator. Amsterdam: North-Holland, Vol. 1,
Chap. 5, pp. 285-330.
LEWBEL, A. (1995): "Consistent nonparametric hypothesis tests with an application to
Slutsky symmetry," Journal of Econometrics, 67(2), pp. 379-401.
270 LAVERGNE
LI, K.-C. (1987): "Asymptotic optimality for Cp,CL, cross-validation and generalized cross-
validation: discrete index set," Annals of Statistics, 15 (3), pp. 958-975.
LI, Q. (1997): "Consistent model specification tests for time series econometric models,"
University of Guelph.
LIEN, D. and Q.H. VUONG (1987): "Selecting the best linear regression model, a classical
approach," Journal of Econometrics, 35, pp. 3-23.
LINHART, H. and W. ZUCCHINI (1986) : Model selection. New-York: Wiley & Sons.
MACKINNON, J.G. (1983) : "Model specification tests against non-nested alternatives,"
Econometric Reviews, 2(1), pp. 85-110.
MALLOWS, C.L. (1973) : "Some comments on Cp," Technometn'cs, 15(4),pp. 661-675.
McALEER, M, and C.R. McKENZIE (1989): "Keynesian and new classical models of un-
employment revisited," Australian National University.
MILLER, A.J. (1990) : Subset selection in regression. London: Chapman and Hall.
MIZON, G.E. and J.F. RICHARD (1986): "The encompassing principle and its application
to testing non-nested hypotheses," Econornetrica, 54 (3), p. 657-678.
NAKAMURA, A.O., M. NAKAMURA and H.O. DULEEP (1990): "Alternative approaches
to model choice," Journal of Economic Behavior and Organization, 14, pp. 97-125.
NEWEY, W.K. (1985): "Maximum likelihood specification testing and conditional moment
tests," Econornetrica, 53(5), pp. 1047-1070.
NEWEY, W.K. and D.L. McFADDEN (1994) : "Large sample estimation and hypothesis
testing," in Handbook of Econometrics, ed. by R.F. Engle and D.L. McFadden. Amsterdam:
North-Holland, Vol. 4, Chap. 36 , pp. 2111-2245.
NISHII, R. (1988): "Maximum likelihood principle and model selection when the true model
is unspecified," Journal of Multivariate Analysis, 27, pp. 392-403.
PESARAN, M.H. (1974) : "On the general problem of model selection," Review of Economic
Studies, 41, pp. 153-171.
PESARAN, M.H. (1987): "Global and partial non-nested hypotheses and asymptotic local
power," Econometric Theory, 3, pp. 69-97.
QUANDT, R.E. (1974) : "A comparison of methods for testing non-nested hypotheses,"
Review of Economic Studies, 56, pp. 92-99.
RIVERS, D. and Q.H. VUONG (1990): "Model selection tests for nonlinear dynamic mod-
els," University of Southern California.
ROBINSON, P.M. (1989) :"Hypothesis testing in semiparametric and nonparametric models
for econometric time series," Review of Economic Studies, 56,' pp. 511-534.
ROTHMAN, D. (1968) : "Letter t o the editor," Technometrics, 10, p. 432.

S A M A R O V , A.M. (1993) : "Exploring regression structure using nonparametric functional
estimation," Journal of the American Statistical Association, 88 (423),pp. 836-847.
SAWA, T . (1978) : "lnformation criteria for discriminating among alternatives regression
models," Econornetrica, 46(6), pp. 1273-91.
SCHWARZ, G . (1978) : "Estimating the dimension o f a model," Annals of Statistics, 6 ( 2 ) ,

pp. 461-464.
SHAO, J . (1993): "Linear model selection by cross-validation," Journal of the American

Statistical Association, 88 (422), pp. 486-494.
SHAO, J . (1996): "Bootstrap model selection," Journal of the American Statistical Associ-
ation, 91 (434), pp. 655-664.
SHIBATA, R. (1976): "Selection o f the order o f an autoregressive model by Akaike's infor-
mation criterion," Biometrika, 63 ( I ) , pp. 117-126.

SHIBATA, R. (1980): "Asymptotically efficient selection o f the order o f the model for esti-
mating parameters o f a linear process," Annals of Statistics, 8 ( I ) , pp. 147-164.
SHIBATA, R. (1981): " A n optimal selection o f regression variables," Biometrika, 68(1),pp.
45-54.
SIN, C.-Y. and H. W H I T E (1996): "Information criteria for selecting possibly misspecified
parametric models," Journal of Econometrics, 71, pp. 207-225.
STONE, C.J. (1977) : "Consistent nonparametric regression," (with discussion) Annals of
Statistics, 5 ( 4 ) , pp. 595-645.
STONE, M. (1974): "Cross-validatory choice and assesment o f statistical predictions," Jour-
nal of the Royal Statistical Society, Series B , 36, pp. 111-147.
STONE, M. (1976): " A n asymptotic equivalence o f choice o f model b y cross-validation and
Akaike's criterion," Journal of the Royal Statistical Society, Series B , 39, pp. 44-47.
STONE, M . (1979): "Comments on model selection criteria o f Akaike and Schwarz," Journal
of the Royal Statistical Society, Series B , 41 ( 2 ) ,pp. 276-278.
TAUCHEN, G. (1985): "Diagnostic testing and evaluation o f maximum likelihood models,"
Journal of Econometrics, 30, pp. 415-443.
T H E I L , H. (1957) : "Specification errors and the estimation o f economic relationships,"
Review of the International Statistical Institute, 25, pp. 41-51.
THOMPSON, M.L. (1978) : "Selection o f variables in multiple regression," Part 1 and 2,
International Statistical Review, 46, pp. 1-19 et 129-146.
272 LAVERGNE
TORO-VIZCARRONDO, C. and T.D. WALLACE (1968) : "A test of the mean square
error criterion for restrictions in linear regression," Journal of the American Statistical As-
sociation, June 1968, pp. 558-72.
TUKEY, J.W. (1967) : "Discussion (of Anscombe, 1967)," Journal of the Royal Statistical
Society, Series B, 29, pp. 47-8.
VIEU, P. (1994) : "Choice of regressors in nonparametric estimation," Computational Statis-
tics and Data Analysis, 17, pp. 575-594.
VIEU, P. (1996) : "Order choice in nonlinear autoregressive models," Statistics, 26, pp.
307-328.
VUONG, Q.H. (1989a) : "Likelihood ratio tests for model selection and non-nested hypothe-
ses," Econornetrica, 57(2), pp. 307-333.
VUONG, Q.H. (1989b) : "Model selection, classical tests and the comprehensive method,"
University of Southern California.
WALLACE, T.D. (1972): "Weaker criteria and tests for linear restrictions in regression,"
Econometn'ca, 40(4), pp. 689-698.
WALLACE, T.D. and C. TORO-VIZCARRONDO (1969) : "Tables for the mean square
error test for exact linear restrictions in regression," Journal of the American Statistical
Association, 64, pp. 1649-1663.
WEST, K.D. (1996): "Asymptotic inference about predictive ability," Econometn'ca, 64 (5),
pp. 1067-1084.
WHITE, H. (1982): "Maximum likelihood estimation of misspecified models," Economet-
rica, 50(1), pp. 1-25.
WHITE, H. (1982): "Regularity conditions for Cox's test of non-nested hypotheses," Journal
of Econometrics, 19(2/3), pp. 301-318.
WHITE, H. (1994): Estimation, inference abd specification analysis. Cambridge: Cambridge
University Press.
WHITE, H. and Y. HONG (1993): "M-testing using finite and infinite dimensional param-
eter estimators," University of California, San Diego.
YAO, Q. and H. TONG (1994): "On subset selection in non-parametric stochastic regres-
sion," Statistics Sinica, 4, pp. 51-70.
YATCHEW, A.J. (1992) : "Nonparametric regression tests based on least squares," Econo-
metn'c Theory, 8, pp. 435-451.
ZELLNER, A. (1971) : An Introbuction to Bayesian Inference in Econometrics. New-York:
Wiley & Sons.
ZHANG,P. (1991) :'Variable selection in nonpararnetric regression with continuous co-
variates," Annals of Statistics, 19(4), pp. 1869-1882.
ZHANG, P. (1993): "Model selection via multifold cross-validation," Annals of Statistics,

21 ( I ) , pp. 299-313.
APPENDIX
Derivation of MSEP
Now, xpo - XpP,' = MpXPo and XPP,' - xPPp= Pp(XPo - y) = -Ppu. Then, denoting
uo = yo - XPo,
Derivation of Equations (5)

We only give a sketch of the proof. Details and assumptions can be found in Linhart and
Zucchini (1986, Appendix). From a Taylor expansion of L:(Q around 8*, we get
Moreover, from another Taylor expansion,
Combining both, we have
-(L
L; (Bc) k! L;(B8) - (1/2)(Bc- 8 ),a2Lf,(8*)
8*),
8888'
and taking expectation, we get
But E [L$(8*)]= E [logg(y(x,@')I, so that the first part of (8) is shown.

By Lemma 3 of Jenrich (1969), we can show that
Taking expectation, we get
Combining this with the first part of (8), we get the second part of (8).

Seleccion of Regressors in Econometrics Non Parametrics & Parametrics Methods - Lavergne

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seleccion of Regressors in Econometrics Non Parametrics & Parametrics Methods - Lavergne

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [University of Windsor]

On: 19 November 2014, At: 11:19

Selection of regressors in econometrics: parametric

To link to this article: http://dx.doi.org/10.1080/07474939808800415

PLEASE SCROLL DOWN FOR ARTICLE

Keywords: Selection of regressors, Discrimination.

JEL classification: Primary C52; Secondary C20.

Copyright 0 1998 by Marcel Dekker, Inc.

of interpretation. More importantly, a number of economic hypotheses correspond

and the Kullback-Leibler discrepancy, because most of the literature on regression

2 Discrimination by mean of a discrepancy

Go, 6 being a parameter in a subset O of RP. An approximating model is labelled

Discrepancy The second step of the approach is to choose a discrepancy function

Without loss of generality, the discrepancy function may be chosen as nonnegative

Implementation Analytical derivation of the dicrepancy measure can lead to

3There may be several aspects of importance. Dealing with a multidimensional discrepancy

estimator 6' is the so-called minimum empirical discrepancy estimator

another for comparing competing models.

Criteria-based rules: parsimony and asymptotic optimality The selection

Discrimination tests Alternatively, if the (finitesample or asymptotic) distri-

Rejection of Ho in favor of H1 or H2 indicates which model dominates the other

explicit distinction between nested and nonnested situations. Another approach is

Discrimination versus specification tests As MacKinnon (1983) points out,

above-mentioned confusion. However, there is no general equivalence between the

3 Selection of regressors in linear models

The discrepancy usually considered in this setting is the Gauss discrepancy

estimator of 4 is here the ordinary least-squares estimator pp = (XLXp)-'X;y,

3.1 Criteria in linear models

where Mp = Ip - Pp is the projection matrix on the subspace orthogonal to the

T h e Prediction Criterion Let us assume for a moment that the restricted p

We then obtain the criterion

: = R S S k / ( n- k ) . This criterion is then coherently derived from a strategy

Hocking's Sp A more specific criterion is found if we assume that ( X ,y ) has a

T h e R~ rule Let us consider now the discrepancy due to approximation. For a

3.2 Discrimination tests for linear models

and the only alternative is

which under Model 0 has a noncentral Fisher distribution with r (= k - p ) and n - k

degrees of freedom and with noncentrality parameter X = PoX1MPXPo/(2a~).

Links with selection criteria Discriminating between Model 0 and a pmodel

the null hypothesis is equivalent to

A test based on a different discrepancy The following testing procedure pro-

~ ~ ~ ( l ' fIi MSE(Z'~)

This is equivalent to the requirement that the matrix

3.2.2 Nonnested models

Consider now two linear nonnested models defined as

Hotteling's approach Comparing the discrepancy due to approximation associ-

where X = (W, Z1, Z2)is the union of X1 and X2 and Mi = I - Xi(XiXi)-'X,', i =

4 Selection of regressors in nonlinear models

Y = r(X,4) + U, E [UIX] = 0 , var [ U ~ I X=] o2

The expected overall discrepancy is

the nonlinear least-squares estimator.13

Y = ro(X) + Uo, E [UIX] = 0, var [ U ~ I X=] 0;.

= O;+EF. [RI(x)-~(x,@]~ (6)

In a nonlinear context, it is not easy to go further, even when assuming a specific

Minimizing PRESS (Prediction Sum of Squares) or C V across competing models

More elaborate cross-validation schemes can be considered. Some authors, see

sion, i.e of the form i

limit. Hence the C V criterion, as a consistent estimator of MSEP, has an associ-

as n + co. In this way, one is estimating the M S E P based on a sample of size

Asymptotic optimality: case of a varying DGP Stone (1979) notes that,

B o o t s t r a p m e t h o d s in selection Bootstrap methods provide a simple and ef-

Unconditional bootstrap: Generate bootstrap samples {(Xf, yl), i = 1,.. ., n)

= Xi, y: = r(Xi,bo)+ u;,) i = 1,.. ., n), where POis the estimator of

where RSS = x:=,

E pogg(ylz, B)] r; E [n-l~;(W] - ( 1 / 2 ) n - l t m - l ( ~ ) ~ ( ~ * )