This article was downloaded by: [York University Libraries] On: 12 August 2014, At: 07:38 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Business & Economic Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ubes20

Adaptive Elastic Net for Generalized Methods of Moments a

bc

Mehmet Caner & Hao Helen Zhang a

Department of Economics, North Carolina State University,, 4168 Nelson Hall, Raleigh, NC 27518 b

Department of Mathematics, University of Arizona, Tucson, AZ 85718()

c

Department of Statistics, North Carolina State UniversityRaleigh, NC 27695() Accepted author version posted online: 06 Sep 2013.Published online: 30 Jan 2014.

To cite this article: Mehmet Caner & Hao Helen Zhang (2014) Adaptive Elastic Net for Generalized Methods of Moments, Journal of Business & Economic Statistics, 32:1, 30-47, DOI: 10.1080/07350015.2013.836104 To link to this article: http://dx.doi.org/10.1080/07350015.2013.836104

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Supplementary materials for this article are available online. Please go to http://tandfonline.com/r/JBES

Adaptive Elastic Net for Generalized Methods of Moments Mehmet CANER Department of Economics, North Carolina State University, 4168 Nelson Hall, Raleigh, NC 27518 ([email protected])

Hao Helen ZHANG

Downloaded by [York University Libraries] at 07:38 12 August 2014

Department of Mathematics, University of Arizona, Tucson, AZ 85718 ([email protected] ); Department of Statistics, North Carolina State University, Raleigh, NC 27695 ([email protected]) Model selection and estimation are crucial parts of econometrics. This article introduces a new technique that can simultaneously estimate and select the model in generalized method of moments (GMM) context. The GMM is particularly powerful for analyzing complex datasets such as longitudinal and panel data, and it has wide applications in econometrics. This article extends the least squares based adaptive elastic net estimator by Zou and Zhang to nonlinear equation systems with endogenous variables. The extension is not trivial and involves a new proof technique due to estimators’ lack of closed-form solutions. Compared to Bridge-GMM by Caner, we allow for the number of parameters to diverge to infinity as well as collinearity among a large number of variables; also, the redundant parameters are set to zero via a data-dependent technique. This method has the oracle property, meaning that we can estimate nonzero parameters with their standard limit and the redundant parameters are dropped from the equations simultaneously. Numerical examples are used to illustrate the performance of the new method. KEY WORDS: GMM; Oracle property; Penalized estimators.

1.

INTRODUCTION

thermore, this method can handle the collinearity arising from a large number of regressors when the system is linear with endogenous regressors. When some of the parameters are redundant (i.e., when the true model has a sparse representation), this estimator can estimate the zero parameters as exactly zero. In this article, we extend the least squares based adaptive elastic net by Zou and Zhang (2009) to GMM. The following issues are pertinent to model selection in GMM: (i) handling a large number of control variables in the structural equation in a simultaneous equation system, or a large number of parameters in a nonlinear system with endogenous and control variables; (ii) taking into account correlation among variables; and (iii) achieving selection consistency and estimation efficiency simultaneously. All of these are successfully addressed in this work. In the least squares case by Zou and Zhang (2009), they do not need an explicit consistency proof since the least squares estimator has a simple and closed-form solution. However, in this article, since the GMM estimator does not have a closed-form solution, an explicit consistency proof is needed before deriving the finite sample risk bounds. This is one major contribution of this article. Furthermore, to get a consistency proof, we have substantially extended the technique used in the consistency proof of Bridge least squares estimator by Huang, Horowitz, and Ma (2008) to the GMM with adaptive elastic net penalty. To derive the finite sample risk bounds, we use the mean value theorem and benefit from consistency proof, unlike the least squares case by Zou and Zhang (2009). The nonlinear nature of the functions introduces additional

One of the most commonly used estimation techniques is the generalized method of moments (GMM) estimation. The GMM provides a unified framework for parameter estimation by encompassing many common estimation methods such as ordinary least squares (OLS), maximum likelihood estimator (MLE), and instrumental variables. We can estimate the parameters by two-step efficient GMM by Hansen (1982). The GMM is an important tool in econometrics, finance, accounting, and strategic planning literature as well. In this article, we are concerned about model selection in GMM when the number of parameters diverges. These situations can arise in labor economics, international finance (see Alfaro, Kalemli-Ozcan, and Volosovych 2008), and so on. In linear models when some of the regressors are correlated with errors and there are a large number of covariates, the model selection tools are essential, since they can improve finite sample performance of the estimators. Model selection techniques are very useful and widely used in statistics. For example, Tibshirani (1996) proposed the lasso method, Knight and Fu (2000) derived the asymptotic properties of the lasso, and Fan and Li (2001) proposed the SCAD estimator. In econometrics, Knight (2008) and Caner (2009) offered Bridge-least squares and Bridge-GMM estimators, respectively. But these procedures all consider finite dimensions and do not take into account the collinearity among variables. Recently, model selection with a large number of parameters has been analyzed in least squares by Huang, Horowitz, and Ma (2008) and Zou and Zhang (2009), where the first article analyzes the Bridge estimator and the second article is concerned with the adaptive elastic net estimator. Adaptive elastic net estimator has the oracle property when the number of parameters diverges with the sample size. Fur-

© 2014 American Statistical Association Journal of Business & Economic Statistics January 2014, Vol. 32, No. 1 DOI: 10.1080/07350015.2013.836104 30

Downloaded by [York University Libraries] at 07:38 12 August 2014

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

difficulties. The GMM case involves partial derivatives of the sample moments that depend on parameter estimates. This is unlike the least squares case where the same quantity does not depend on parameter estimates. This results in the need for consistency proof as mentioned above. Also, we extend the study by Zou and Zhang (2009) to conditionally heteroscedastic data, and this results in tuning parameter for l1 norm to be larger than the one in least squares case. We also pinpoint ways to handle stationary time series cases. The estimator also has the oracle property, and the nonzero coefficients are estimated converging to a normal distribution. This is their standard limit and furthermore, the zero parameters are estimated as zero. Note that the oracle property is a pointwise criterion. Earlier works on diverging parameters include Portnoy (1984), Huber (1988), and He and Shao (2000). In recent years, there are a few works on penalized methods for standard linear regression with diverging parameters. Fan and Peng (2004) studied the nonconcave penalized likelihood with a growing number of nuisance parameters; Lam and Fan (2008) analyzed the profile likelihood ratio inference with a growing number of parameters; and Huang, Horowitz, and Ma (2008) studied asymptotic properties of bridge estimators in sparse linear regression models. As far as we know, this is the first article to estimate and select the model in GMM with a diverging number of parameters. In econometrics, sieve estimation will be a natural application of shrinkage estimators. There are several articles that use sieves (e.g., Ai and Chen 2003; Newey and Powell 2003; Chen 2007; Chen and Ludvigson 2009). In these articles, we see that sieve dimension is determined by trying several possibilities or left for future work. Adaptive elastic net GMM can simultaneously determine the sieve dimension and estimate the structural parameters. We also see in unpenalized GMM with many parameters case, there is an article by Han and Phillips (2006). Liao (2011) considered adaptive lasso with fixed number of invalid moment conditions. Section 2 presents the model and the new estimator. Then in Section 3, we derive the asymptotic results for the proposed estimator. Section 4 conducts simulations. Section 5 provides an asset pricing example used by Chen and Ludvigson (2009). Section 6 concludes. The Appendix includes all the proofs.

2.

MODEL

Let β be a p-dimensional parameter vector, where β ∈ Bp , which is a compact subset in R p . The true value of β is β0 . We allow p to grow with the sample size n, so when n → ∞, we have p → ∞, but p/n → 0 as n → ∞. We do not provide a subscript of n for parameter space not to burden ourselves with the notation. The population orthogonality conditions are

2.1

31

The Estimators

We first define the estimators that we use. The estimators that we are interested in aim to answer the following questions. If we have a large number of control variables, some of which may be irrelevant (we may have also a large number of endogenous variables and control variables) in the structural equation in a simultaneous equation system or a large number of parameters in a nonlinear system with endogenous and control variables, can we select the relevant ones as well as estimate the selected system simultaneously? If we have a large number of variables among which there may be possibly some correlation among the variables, can this method handle that? Is it also possible for the estimator to achieve the oracle property? The answers to all three questions are affirmative. First of all, the adaptive elastic net estimator simultaneously selects and estimates the model when there are a large number of parameters/regressors. It can also take into account the possible correlation among the variables. By achieving the oracle property, the nonzero parameters are estimated with their standard limits, and the zero ones are estimated as exactly zero. This method is computationally easy and uses data-dependent methods to set small coefficient estimates to zero. A subcase of the adaptive elastic net estimator is the adaptive lasso estimator that can handle the first and third questions but does not handle correlation among a large number of variables. First we introduce the  notation: we use the following norms p p for the vector β1 = j =1 |βj |, β22 = j =1 |βj |2 , also  p 2+l β2+l , where l > 0 is a positive number. For j =1 |βj | 2+l = a matrix A, the norm is A22 = tr(A A). We start by introducing the adaptive elastic net estimator, given the positive and diverging tuning parameters λ1 , λ∗1 , λ2 (how to choose them in finite samples and its asymptotic properties will be discussed in assumptions and then in the Simulation section):     n  n   gi (β) Wn gi (β) βˆaenet = (1 + λ2 /n) arg min β∈Bp

i=1

⎤⎫ p ⎬  + λ2 β22 + λ∗1 wˆ j |βj |⎦ , ⎭

i=1

(1)

j =1

where wˆ j = |βˆ 1 |γ , βˆenet is a consistent estimator immediately enet explained below, γ is a positive constant, and p = nα , 0 < α < 1. The assumption on γ will be explained in detail in Assumption 3(iii). Wn is a q × q weight matrix that will be defined in the assumptions below. The elastic net estimator, which is used in the weights of the penalty above, is

 βˆenet = (1 + λ2 /n) arg min Sn (β) , β∈Bp

E[g(Xi , β0 )] = 0, where the data are {Xi : i = 1, 2 . . . , n}, g(·) is a known function, and the number of orthogonality restrictions is q, q ≥ p. So we also allow q to grow with the sample size n, but q/n → 0 as n → ∞. From now on, we denote g(Xi , β) as gi (β) for simplicity. Also assume that gi (β) are independent, and we do not use gni (β) just to simplify the notation.

where Sn (β) =

 n  i=1

 gi (β)

 Wn

n 

 gi (β) + λ2 β22 + λ1 β1 ,

i=1

(2) λ1 , λ2 are positive and diverging sequences that will be defined in Assumption 5.

Downloaded by [York University Libraries] at 07:38 12 August 2014

32

Journal of Business & Economic Statistics, January 2014

We now discuss the penalty functions in both estimators and explain why we need βˆenet . The elastic net estimator has both l1 and l2 penalties. The l1 penalty is used to perform automatic variable selection, and the l2 penalty is used to improve the prediction and handles the collinearity that may arise with a large number of variables. However, the standard elastic net estimator does not provide the oracle property. It turns out that, by introducing an adaptive weight in the elastic net, we can obtain the oracle property. The adaptive weights play crucial roles, since they provide data-dependent penalization. An important point to remember is when we set λ2 = 0 in the adaptive elastic net estimator (1), we obtain the adaptive lasso estimator. This is simple and we can also get the oracle property. However, with a large number of parameters/variables that may be highly collinear, an additional ridge-type penalty as in adaptive elastic net offers estimation stability and better selection. Before the assumptions we introduce the following notations. Let the collection of nonzero parameters be the set A = {j : |βj 0 | = 0} and denote the absolute value of the minimum of the nonzero coefficients as η = minj ∈A |βj 0 |. Also, the cardinality of A is pA (the number of nonzero coefficients). We now provide the main assumptions. ˆ n (β) = Assumption 1. Define the following q × p matrix G ∂gi (β) . Assume the following uniform law of large numbers ∂β 

n

i=1

P

ˆ n (β)/n − G(β)22 → 0, sup G

β∈Bp

where G(β) is continuous in β and has a full column rank p. Also β ∈ Bp ⊂ R p , Bp is compact, and all the absolute value of individual components of the vector β are uniformly bounded by a constant a, |βj 0 |, 0 ≤ a < ∞, j = 1, . . . , p. Note that specif ically supβ∈Bp  n1 ni=1 E ∂g∂βi (β) − G(β)22 → 0, defines G(β).  Assumption 2. Wn is a symmetric, positive definite matrix, P

and Wn − W 22 → 0, where W is finite and positive definite matrix as well. Assumption 3.  (i) [n−1 ni=1 Egi (β0 )gi (β0 ) ]−1 − −1 22 → 0. (ii) Assume p = nα , 0 < α < 1, p/n → 0, q/n → 0 as n → ∞, and p, q → ∞, q ≥ p, so q = nν , 0 < ν < 1, α ≤ ν. (iii) The coefficient on the weights: γ satisfies the following . bound γ > 2+α 1−α Assumption 4. Assume that max i

Egi (βA,0 )2+l 2+l → 0, nl/2

for l > 0, and where βA,0 represents the true values of the nonzero parameters and it is of pA dimension. The dimension also increases with the sample size; pA → ∞, as n → ∞, and 0 ≤ pA ≤ p. Assumption 5. (i) λ1 /n → 0, λ2 /n → 0, λ∗1 /n → 0.

(ii) λ∗1 γ (1−α) n → ∞. n3+α

(3)

n1−ν η2 → ∞.

(4)

n1−α ηγ → ∞.

(5)

(iii)

(iv)

(v) Set η = O(n−m ), 0 < m < α/2. Now we provide some discussions on the above assumptions. Most of them are standard and used in other papers that establish asymptotic results for penalized estimation in the context of diverging parameters. The rest of them are typically used for the GMM to deal with nonlinear equation systems with endogenous variables. Since p → ∞, Assumption 1 can be thought of uniform convergence over sieve spaces Bp . For the iid subcase, primitive conditions are available in condition 3.5M by Chen (2007). Assumptions 1 and 2 are standard in the literature of GMM (Chen 2007; Newey and Windmeijer 2009a,b). They are similar to assumptions 7 and 3 by Newey and Windmeijer (2009a). It is also important to see that Assumption 2 is needed for the two-step nature of the GMM problem. In the first step we can use any consistent estimator (i.e., elastic net) and substitute  in Wn = n−1 ni=1 gi (βˆenet )gi (βˆenet ) in the adaptive elastic net estimation, where βˆenet is the elastic net estimator. Also note that with different estimators, we can define the limit weight W differently. Depending on the estimators Wn , W can change. Assumption 3 provides a definition of variance-covariance matrix, and then establishes that the number of diverging parameters cannot exceed the sample size. This is also used by Zou and Zhang (2009). For the penalty exponent in the weight, our condition is more stringent than in the least squares case by Zou and Zhang (2009). This is needed for model selection for local to zero coefficients in GMM format. Assumption 4 is a primitive condition for the triangular array central limit theorem. This is also restraining the number of orthogonality conditions q. The main issues are the tuning parameter assumptions that reduce bias. We first compare with Bridge estimator by Knight and Fu (2000); there in theorem 3, they need λ/na/2 → λ0 ≥ 0, where 0 < a < 1, and λ represented the only tuning parameter. In our Assumption 5(i), we need λ1 /n → 0, λ∗1 /n → 0, λ2 /n → 0, so ours can be larger than the Knight and Fu (2000) estimator—we can penalize more in our case. This is due to Bridge type of penalty, which requires less penalization to reduce bias and get the oracle property. Theorem 2 by Zou (2006) for the adaptive lasso in least squares assumes λ/n1/2 → 0. We think the reason that the GMM estimator requires large penalization is due to its complex model nature, since there are more elements that contribute to bias here. Theorem 1 by Gao and Huang (2010) display the same tuning analysis as Zou (2006). The rates on λ1 , λ2 are standard, but the rate of λ∗1 depends on α and γ . The conditions on λ1 , λ∗1 , λ2 are needed for consistency and the bounds on the moments of the estimators. We also allow for local to zero (nonzero) coefficients, but Assumptions 3 and 5 (Equations (4) and (5)) restrict their magnitude. This is tied

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

Downloaded by [York University Libraries] at 07:38 12 August 2014

to the possible number of nonzero coefficients and seeing that α ≤ ν. If there are too many nonzero coefficients (α near 1), then for model selection purpose the coefficients should slowly approach zero. If there are few nonzero coefficients, to give an example (α near 0), then the order of η should be slightly larger than n−1/2 . This also confirms and extends Leeb and P˝otscher (2005) finding that local to zero coefficients should be larger than n−1/2 to be differentiated from zero coefficients. This is shown in proposition A.1(2) in their article. Our results extend that result to the diverging parameter case. Assumption 5(iii) is needed to get consistency for local to zero coefficients. Assumptions 5(iv) and (v) are needed for model selection consistency of local to zero coefficients. To be specific about implications of Assumptions 5(iii), (iv), and (v) on the order of η, since η = O(n−m ), then Assumption 5(iii) implies that

33

3.

ASYMPTOTICS

We define an estimator that is related to the adaptive elastic net estimator in (1) and also used in the risk bound calculations: ⎡    n n   gi (β) Wn gi (β) + λ2 β22 βˆw = arg min ⎣ β∈Bp

+ λ1

i=1

p 

i=1

⎤ wˆ j |βj |⎦ .

(8)

j =1

The following theorem provides consistency for both the elastic net estimator and βˆw . Theorem 1. Under Assumptions 1–3 and 5, (i)

1−ν > m, 2

P

βˆenet − β0 22 → 0. (ii)

and Assumption 5(iv) implies that

P

βˆw − β0 22 → 0.

1−α > m. γ Combining these two inequalities with Assumption 5(v), we obtain 

1−α 1−ν α m < min , , γ 2 2

 .

Now we can see that with a large number of moments, or a large number of parameters, m may get small, so the magnitude of η should be large. To give an example take γ = 5, α = 1/3, ν = 2/3, which gives us an upper bound of m < 2/15. So in that scenario, η = O(n−2/15 ), to get selected as nonzero. It is clear that this is much larger than n−1/2 , which Leeb and P˝otscher (2005) found. (i.e., Assumption 3(iii)), we can Note also that with γ > 2+α 1−α assure that the conditions on λ∗1 in Assumptions 5(i) and (ii) are compatible with each other. Using Assumptions 1, 2, 3(i), we can see that ˆ  /n)Wn (G(β)/n)], ˆ 0 < b ≤ Eigmin [(G(β)

(6)

and

Remark 1. It is clear from Theorem 1(ii) that adaptive elastic net estimator in (1) is also consistent. We should note that in the article by Zou and Zhang (2009), where the least squares adaptive elastic net estimator is studied, there is no explicit consistency proof. This is due to using a simple linear model. However, for the GMM adaptive elastic net estimator we have the partial derivative of g(·), which depends on estimators unlike in the linear model case. Specifics are in Equations (A.31)– (A.36). Therefore, we need a new and different consistency proof compared with the least squares case. We need to introduce an estimator that is closely tied to the elastic net estimator above: ˆ 2 , λ1 ) = arg min Sn (β), β(λ β∈Bp

(9)

where Sn (β) is defined in (2). This is also the estimator we get when we set for all j, wˆ j = 1 in βˆw . Next, we provide bounds for our estimators. These are used then in the proofs for oracle property, and the limits of the estimators. Theorem 2. Under Assumptions 1–3 and 5,   E βˆw − β0 22 p λ22 β0 22 + n3 pB + λ21 E j =1 wˆ j2 + o(n2 ) ≤ 4 , [n2 b + λ2 + o(n2 )]2 and



ˆ ˆ Eigmax [(G(β) /n)Wn (G(β)/n)] ≤ B < ∞,

(7)

with probability approaching 1, where β ∈ [β0 , βˆw ), B > 0. These are obtained by exercise 8.26b by Abadir and Magnus (2005), and lemma A0 by Newey and Windmeijer (2009b) result (eigenvalue inequality for increasing number of dimension case). βˆw is related to adaptive elastic net estimator and immediately defined below. Here Eigmin (M) and Eigmax (M), respectively, represent the minimal and maximal eigenvalue of a generic matrix M.

2 2 2 3 2   ˆ 2 , λ1 ) − β0 22 ≤ 4 λ2 β0 2 + n pB + λ1 p + o(n ) . E β(λ [n2 b + λ2 + o(n2 )]2

Remark 2. Note that the first bound is related to the estimator in (8), the second bound is related to the estimator in (9), βˆw is ˆ 1 , λ2 ) related to the adaptive elastic net estimator in (1), and β(λ 2 is related to the estimator in (2). Even though β0 2 = O(p), and p → ∞, the bound depends on λ22 β0 22 /n2 → 0p in large samples by Assumptions 3(ii) and 5. Also, λ21 E j =1 wˆ j2 is dominated by n4 in the denominator in large samples as seen in

34

Journal of Business & Economic Statistics, January 2014

the proof of Theorem 3(i). It is clear from the last√result that the elastic net estimator is converging at the rate of n/p. Theorem 2 extends the least squares case of theorem 3(i) by Zou and Zhang (2009) to the nonlinear GMM case. The risk bounds are different from their case due to the nonlinear nature of our problem. The partial derivative of the sample moment depends on parameter estimates in our case, which complicates the proofs.  , 0p−pA ) where βA,0 represents the vector of Write β0 = (βA,0 nonzero parameters (true values). Its dimension grows with the sample size, and 0p−pA vector of p − pA elements represents the zero (redundant) parameters. Let βA represent the nonzero parameters and of dimension pA × 1. Then define ⎧    n n ⎨    gi (βA ) Wn gi (βA ) + λ2 βj2 β˜ = arg min β ⎩

Downloaded by [York University Libraries] at 07:38 12 August 2014

i=1

+ λ∗1

 j ∈A

⎫ ⎬

i=1

j ∈A

wˆ j |βj | , ⎭

to zero coefficients and the requirement on their order of the magnitude. Now we provide the limit law for the estimates of the nonzero parameter values (true values). Denote the adaptive elastic net estimators that correspond to nonzero true parameter values as the vector βˆaenet,A , which is of dimension pA × 1. Define a consistent variance estimator for nonzero parameters that can be ˆ ∗ . We also define −1 derived from as  ∗ n elastic net estimators 2 −1  −1 as [n − −1 ∗ 2 → 0. i=1 Egi (βA,0 )gi (βA,0 ) ] ˆ −1 Theorem 4. Under Assumptions 1–5, given Wn =  ∗ , set −1 W = ∗ ,  1/2 ˆ ˆ ˆ βˆaenet,A )  ˆ −1 δ  Kn G( ∗ G(βaenet,A ) d × n−1/2 (βˆaenet,A − βA,0 ) → N (0, 1), ˆ βˆ I +λ (G(

ˆ βˆ ˆ −1 G( ) 

))−1

aenet,A ∗ ] is a square matrix of where Kn = [ 2 aenet,A 1+λ2 /n dimension pA and δ is a vector of Euclidean norm 1.

Remark 4. 1. First we see that P

where A = {j : βj 0 = 0, j = 1, 2, . . . , p}. Our next goal is to ˜ 0p−pA ] converges show that, with probability 1, [(1 + λ2 /n)β, to the solution of the adaptive elastic net estimator in (1). Theorem 3. Given Assumptions 1–3 and 5 ˜ 0) is the solution (i) with probability tending to 1, ((1 + λn2 )β, to (1). (ii) (Consistency in Selection) we also have P ({j : βˆaenet,j = 0} = A) → 1. Remark 3. 1. Theorem 3(i) shows that ideal estimator β˜ becomes the same as the adaptive elastic net estimator in large samples. So the GMM elastic adaptive net estimator has the ˜ 0p−pA ). Theorem 3(ii) shows same solution as ((1 + λ2 /n)β, that the nonzero adaptive elastic net estimates display the oracle property with Theorem 4. This is a sharper result than the one in Theorem 3(i). This is an important extension of the least squares case of theorems 3.2 and 3.3 by Zou and Zhang (2009) to the GMM estimation. 2. We allow for local to zero parameters and also provide an assumption when they may be considered as nonzero. This is Assumption 5(iii) and (iv), n1−ν η2 → ∞, n1−α ηγ → ∞, where q = nν , p = nα , 0 < α ≤ ν < 1. The implications of the assumptions on the magnitude of the smallest nonzero coefficient is discussed after the assumptions. The proof of Theorem 3(ii) clearly shows that, as long as Assumption 5 is satisfied, the model selection for local to zero coefficients is possible. However, the local to zero coefficients cannot be arbitrarily close to zero to be selected. This is well established by Leeb and P˝otscher (2005). Leeb and P˝otscher (2005) showed, in their proposition A1(2), as long as the order of local to zero coefficients is larger than n−1/2 in magnitude, they can be selected. So this is like a lower bound for nonzeros to be selected as zeros. Our Assumption 5 is the extension of their result to the GMM estimator with a diverging number of parameters. In the diverging parameter case, there is a tradeoff between the number of local

Kn − IpA 22 → 0, due to Assumptions 1, 2, and λ2 = o(n). 2. This theorem clearly extends those by Zou and Zhang (2009) from the least squares case to the GMM estimation. This result generalizes theirs to nonlinear functions of endogenous variables that are heavily used in econometrics and finance. The extension is not straightforward, since the new limit result depends on an explicit separate consistency proof unlike the least squares case by Zou and Zhang (2009). This is mainly because the partial derivative of the sample moment function depends on the parameter estimates, which is not shared by the least squares estimator. The limit that we derive also corresponds to the standard GMM limit by Hansen (1982), where the same result was derived for a fixed number of parameters with a wellspecified model. In this way, Theorem 4 also generalizes the result by Hansen (1982) to the direction of a large number of parameters with model selection. 3. Note that Kn term is a ridge regression like term that helps to handle the collinearity among the variables. 4. Note that if we set λ2 = 0, we obtain the limit for adaptive Lasso GMM estimator. In that case Kn = IpA , and  1/2 −1/2 ˆ ˆ ˆ βˆalasso,A )  ˆ −1 n (βˆalasso,A − βA,0 ) δ  G( ∗ G(βalasso,A ) d

→ N (0, 1). There will be discussions on how to choose the tuning parameters λ1 , λ2 , λ∗1 , and how to set the small parameter estimates to zero in finite samples in the simulation section. 5. Instead of Liapounov central limit theorem, we can use central limit theorem for stationary time series data. These already exist in the book by Davidson (1994). Theorem 4 will proceed as before in the independent data case. When we define the GMM objective function, use sample moments as weighted in time. We conjecture that this results in same proofs for Theorems 1–3. This technique of weighting sample moments by time is used by Otsu (2006) and Guggenberger and Smith (2008).

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

6. After obtaining the adaptive elastic net GMM results, one can run the unpenalized GMM with nonzero parameters and conduct inference. 7. First from Remark 1, we have Kn − IpA 22 = op (1). Then δ2 = (δ12 + · · · + δp2 A )1/2 = 1, and δ is a pA vector. And then by Assumption 1 and consistency of the adaptive elastic net ˆ βˆaenet,A )2 = Op (n1/2 ). These provide the rate of we have G( √ n/pA for the adaptive elastic net estimators. 4.

SIMULATION

In this section we analyze the finite sample properties of the adaptive elastic net estimator for GMM. Namely, we evaluate its bias, the root mean squared error (MSE), as well as the correct number of redundant versus relevant parameters. We have the following simultaneous equations for all i = 1, . . . , n

Downloaded by [York University Libraries] at 07:38 12 August 2014

yi = xi β0 + i , xi = zi π + ηi ,  i = ρι ηi + 1 − ρ 2 ι vi ,

log n , n

where |A| is the cardinality of the set A. SSE =  ˆ  Wn [n−1 ni=1 gi (β)]. ˆ Basically, given a spe[n−1 ni=1 gi (β)] cific λs , we analyze how many nonzero coefficients are in the estimator and use this to calculate the cardinality of A, and for that choice compute SSE. The final λ is chosen as λˆ s = arg min BIC(λs ), λs ∈

Ezi i = 0, for all i = 1, . . . , n. We have two different designs for the parameter vector β0 . In the first case β0 = {3, 3, 0, 0, 0} (Design 1), and in the second one β0 = {3, 3, 3, 3, 0} (Design 2). We have n = 100, and zi ∼ = N(0, z ) for all i = 1, . . . , n, and ⎡ ⎤ 1 0.5 0 0 0 ⎢ 0.5 1 0 0 0 ⎥ ⎥. z = ⎢ ⎣ 0 0 1 0 0⎦ 0 0 0 0 1 So there is correlation between zi ’s and this affects the correlation between xi ’s since two equations are correlated. In this section, we compare three methods: GMM-BIC by Andrews and Lu (2001), Bridge-GMM by Caner (2009), and the adaptive elastic net GMM. We use four different measures to compare them here. First, we look at the percentage of correct models selected. Then we evaluate the following summary MSE: E i i ,

regression (LAR) is not used because it is not clear whether it is useful in the GMM context. This modified shooting algorithm amounts to using KuhnTucker conditions for a corner solution. First, the absolute value of the partial derivative of the GMM objective (unpenalized one) with respect to the parameter of interest is evaluated at zero for that parameter, and for the rest at the current adaptive elastic net estimates. If this is less than λ∗1 /|βˆenet |4.5 , then we set that parameter to zero. We have also tried slightly larger exponents than 4.5, and observed that the results are not affected much. Note that the reason for a large γ comes from Assumption 3(iii). This is similar to the adaptive lasso case used by Zhang and Lu (2007). The choice of λ’s in both Bridge-GMM and the adaptive elastic net GMM is done via BIC. This is suggested by Zou, Hastie, and Tibshirani (2007) as well as by Wang, Li, and Tsai (2007). Specifically, we use the following BIC by Wang, Li, and Leng (2009). For each pair of λs = (λ∗1 , λ2 ) ∈ , BIC(λs ) = log(SSE) + |A|

where the number of instruments q is set to be equal to the number of parameters p, xi is a p × 1 vector, zi is a p × 1 vector, ρ = 0.5, and π is a square matrix of dimension p. Furthermore, ηi is iid N(0, Ip ), vi is iid with N (0, Ip ), and ι is a p × 1 vector of ones. The estimated model is:

E[(βˆ − β0 )  (βˆ − β0 )],

35

(10)

and βˆ represents the estimated coefficient where  = vector given by three different methods. This measure is commonly used in statistics literature (see Zou and Zhang 2009). The other two measures are concerned about individual coefficients. First, the bias of each individual coefficient estimate is measured. Then the root MSE of each coefficient is computed. We use 10,000 iterations. Truncation of small coefficient estimates is set to zero via |βˆBridge | < 2/λ for Bridge-GMM as suggested by Caner (2009). For the adaptive elastic net, we use the modified shooting algorithm given in appendix 2 by Zhang and Lu (2007). Least angle

where  represents a finite number of possible values of λs . The Bridge-GMM estimator by Caner (2009) is βˆ that minimizes Un (β), where  n    n p    gi (β) Wn gi (β) + λ |βj |γ , (11) Un (β) = i=1

j =1

i=1

for a given positive regularization parameter λ and 0 < γ < 1. We now describe the model selection by GMM-BIC proposed by Andrews and Lu (2001). Let b ∈ R p denote a model selection vector. By definition, each element of b is either zero or one. If the jth element of b is one, the corresponding βj is to be estimated; if the jth element of b is zero we set βj to be zero. We set |b| as the number of parameters to be estimated, p or in the equivalent form, |b| = j =1 |bj |. We then set β[b] as the p × 1 vector representing the element by the element (Hadamard) product of β and b. The model selection will be based on the GMM objective function and a penalty term. The objective function in BIC benefits from  n    n   gi (β[b] ) Wn gi (β[b] ) , (12) Jn (b) = i=1

i=1

where in the simulation gi (β[b] ) = zi (yi − xi β[b] ). The model selection vectors “b” in our case represent 31 different possibilities (excluding the all-zero case). The following are the possibilities for all “b” vectors: M = [M11 , M12 , M13 , M14 , M15 ],

36

Journal of Business & Economic Statistics, January 2014

Table 1. Success percentages of selecting the correct model Estimators Adaptive Elastic Net Bridge-GMM GMM-BIC

Table 3. Bias and RMSE results of Design 1

Design 1

Design 2

91.2 100.0 6.9

94.9 100.0 0.0

Downloaded by [York University Libraries] at 07:38 12 August 2014

NOTE: The GMM-BIC (Andrews and Lu 2001) represents the models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied by Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

where M11 is the identity matrix of dimension 5, I5 , which represents all the possibilities with only one nonzero coefficient. M12 represents all the possibilities with two nonzero coefficients, ⎛ ⎞ 1 1 1 1 0 0 0 0 0 0 ⎜1 0 0 0 1 1 1 0 0 0⎟ ⎜ ⎟ ⎟ (13) M12 = ⎜ ⎜0 1 0 0 1 0 0 1 1 0⎟. ⎝0 0 1 0 0 1 0 1 0 1⎠ 0 0 0 1 0 0 1 0 1 1 In the same way, M13 represents all possibilities with three nonzero coefficients, M14 represents all the possibilities with four nonzero coefficients, and M15 is the vector of ones, showing all nonzero coefficients. The true model in Design 1 is the first column vector in M12 . For Design 2, the true model is in M14 and that is (1, 1, 1, 1, 0) . The GMM-BIC selects the model based on minimizing the following criterion among the 31 possibilities: Jn (b) + |b| log(n).

(14)

The penalty term penalizes larger models more. Denote the optimal model selection vector by b∗ . After selecting the optimal model in (14), the vector b∗ , we then estimate the model parameters by the GMM. Next we present the results on Tables 1–4 for these three techniques that are examined in the simulation section. In Table 1, we provide the correct model selection percentages for Designs 1 and 2. We see that both Bridge and Adaptive Elastic Net are doing very well. The Bridge-GMM selects the correct model 100%, and the Adaptive Elastic Net 91% − 95% of the time, whereas the GMM-BIC selects only 0% − 6.9%. This is due to lots of possibilities in the case of GMM-BIC, and with a large number of parameters the performance of GMM-BIC tends to deteriorate. Table 2 provides a summary of MSE measure results. This clearly shows that the Adaptive Elastic Net estimator is the best among the three, since its MSE figures are the smallest. The GMM-BIC is much worse in terms of MSE, due to its wrong model selection, and

Adaptive Elastic Net

β1 β2 β3 β4 β5

Bridge-GMM

BIAS

RMSE

BIAS

RMSE

BIAS

RMSE

−0.244 −0.244 0.013 0.000 0.013

0.272 0.272 0.042 0.009 0.041

−0.117 −0.667 0.000 0.000 0.000

0.126 0.669 0.000 0.000 0.000

2.903 −4.082 −0.859 0.612 1.162

159.85 261.32 158.839 188.510 62.240

NOTE: The GMM-BIC (Andrews and Lu 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied by Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

after the model selection estimating the zero coefficients with nonzero and large magnitudes. Tables 3 and 4 provide the bias and root MSE of each coefficient in Designs 1 and 2. Comparing the Bridge with the Adaptive Elastic Net, we observe that the bias of the nonzero coefficients are generally smaller for the Adaptive Elastic Net. The same is true generally in the case of root MSEs, which are smaller for the nonzero coefficients in the Adaptive Elastic Net estimator. To get confidence intervals for nonzero parameters, one can run the adaptive elastic net first and find the zero and nonzero coefficients. Then for those nonzero estimates, we have the standard GMM standard errors by Theorem 4. By using that we can calculate confidence intervals for nonzero coefficient parameters. 5.

Ct

where Ct represents the consumption at time t, and ι0 and φ0 are both positive and they represent time discount and curvature of the utility function, respectively. Rl,t+1 is the lth asset Table 4. Bias, RMSE of Design 2

Table 2. Summary mean squared error (MSE) Design 1

Design 2

Adaptive Elastic Net Bridge-GMM GMM-BIC

1.8 4.2 165848.5

1.3 1.3 876080.2

NOTE: The MSE formula is given in (10). Instead of expectations, the average of iterations is used. A small number for summary MSE is desirable for a model. The GMM-BIC (Andrews and Lu 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied by Caner (2009). The Adaptive Elastic Net is the new procedure proposed in this study.

APPLICATION

In this part, we go through a useful application of the new estimator. The following is the external habit specification model considered by Chen and Ludvigson (2009) (also Chen 2007, equation (2.7)): ⎞ ⎛  −φ0 1 − h  Ct −φ0 0 Ct+1 Ct+1 E ⎝ι0   Ct−1 −φ0 Rl,t+1 − 1|zt ⎠ = 0, Ct 1 − h0

Adaptive Elastic Net

Estimators

GMM-BIC

β1 β2 β3 β4 β5

Bridge-GMM Bridge-GMM

GMM-BIC GMM-BIC

BIAS

RMSE

BIAS

RMSE

BIAS

RMSE

−0.181 −0.181 0.010 −0.038 −0.001

0.193 0.193 0.061 0.071 0.007

−0.112 −0.662 0.157 0.337 0.000

0.124 0.665 0.166 0.341 0.000

−0.805 −0.326 −0.314 −6.759 7.740

158.171 112.970 120.358 659.673 617.509

NOTE: The GMM-BIC (Andrews and Lu 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied by Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

return at time t + 1, h0 (.) ∈ [0, 1) is an unknown habit formation function, and zt is the information set and this will be linked to valid instruments. We took only one lag in consumption ratio, rather than several of them. The possibility of this specific model is mentioned by Chen and Ludvigson (2009, p. 1069). Chen and Ludvigson (2009) used the sieve estimation to estimate the unknown h0 function. They set up the dimension of the sieve as a given number. In this article, we use the adaptive elastic net GMM to automatically select the dimension of sieve and estimate the structural parameters at the same time. The parameters and the unknown habit function that we try to estimate are δ0 , γ0 , h0 . Now denote

Downloaded by [York University Libraries] at 07:38 12 August 2014

ρ(Ct , Rl,t+1 , ι0 , φ0 , h0 )   Ct −φ0   Ct+1 −φ0 1 − h0 Ct+1 = ι0   −φ0 Rl,t+1 − 1. Ct 1 − h0 CCt−1 t Before setting up the orthogonality restrictions, set s0j (zt ) is a sequence of known basis functions that can approximate any square integrable function. Then for each l = 1, . . . , N , j = 1, . . . , JT , the restrictions are E[ρ(Ct , Rl,t+1 , ι0 , φ0 , h0 )s0j (zt )] = 0. In total we have NJT restrictions, and N is fixed, but JT → ∞, as T → ∞, and NJT /T → 0, as T → ∞. The main issue is the approximation of the unknown function h0 . Chen and Ludvigson (2009) used sieves to approximate that function. In theory the dimension of sieve KT → ∞, but KT /T → 0, as T → ∞. Like Chen and Ludvigson (2009), we use an artificial neural network sieve approximation     KT  Ct−1 Ct−1 = ζ0 + ζj  τj + κj , h Ct Ct j =1 where (.) is an activation function, and this is chosen as a logistic function (x) = (1 + e−x )−1 . This implies that to estimate the habit function, we need 3KT + 1 parameters. The parameters are ζ0 , ζj , τj , κj , j = 1, . . . , KT . Chen and Ludvigson (2009) used KT = 3. In our article, along with parameters δ0 , γ0 , estimation of h0 through selection of correct sieve dimension will be done. So if true dimension of sieve is KT 0 , with 0 ≤ KT 0 ≤ KT , then the Adaptive Elastic Net GMM aims to estimate that dimension (through estimation of parameters in the habit function). The total number of parameters to be estimated is p = 3(KT + 1), since we also estimate ι0 , φ0 in addition to the habit function parameters. The number of orthogonality restrictions is q = NJT , and we assume q = N JT ≥ 3(KT + 1) = p. Chen and Ludvigson (2009) used sieve minimum distance estimator, and Chen (2007) used sieve GMM estimator to estimate the parameters. Specifically, equation (2.16) by Chen (2007) uses unpenalized sieve GMM estimation for this problem. Instead we will assume that true dimension of the sieve is unknown and will estimate the parameters along with habit function (parameters in that function) with the adaptive elastic net GMM estimator. Set β = (ι, τ, h) and the compact sieve space is Bp = Bδ × Bγ × HT . The compactness assumption is discussed by Chen and Ludvigson (2009, p. 1067), which is mainly needed so that sieve parameters do not generate tail observations on (.). Also set the approximating known basis

37

functions as s(zt ) = (s0,1 (zt ), . . . , s0,JT (zt )) , which is a JT × 1 vector. So JT = 3, there are three instruments. These are a constant, lagged consumption growth, and its square.1 There are seven asset returns that are used in the study, so N = 7. The detailed explanations can be found in the article by Chen and Ludvigson (2009). Implementation Details: 1. First, we run the elastic net GMM to obtain the adaptive weights wj ’s. The elastic net GMM has the same objective function as the adaptive version but with wj = 1 for all j. The enet-GMM estimator is obtained by setting the weights as 1 in the estimator in the third step given as follows. 2. Then for the weights, since a priori it is known that nonzero coefficients cannot be large positive, we use γ = 2.5 in the exponent. For specifying the weights, wj = 1/|βˆenet,j |2.5 is chosen for all j. We have also experimented with γ = 4.5 as in the simulations, and the results were mildly different but qualitatively very similar, so those are not reported. 3. Our adaptive elastic net GMM estimator is ⎧  T ⎨  $ ρ(Ct , Rt+1 , β) s(wt ) βˆ = (1 + λ2 /T ) arg min β∈Bp ⎩ t=1

× Wˆ

 T 

ρ(Ct , Rt+1 , β)

$

 s(wt )

t=1

+ λ∗1

3(K T +1)  j =1

wˆ j |βj | + λ2

3(K T +1) 

βj2

j =1

⎫ ⎬ ⎭

,

where β1 = ι, β2 = φ, and the remaining 3KT + 1 parameters correspond to the habit function estimation by sieves. We use the following weight to make the comparison with Chen and Ludvigson (2009) in a fair way: Wˆ = I ⊗ (S  S)− , where S is (s(z1 ), . . . , s(zt ), . . . , s(zT )) , which is T × 3 matrix, where we use Moore-Penrose inverse as described in (2.16) by Chen (2007). Note that ρ(Ct , Rl,t+1 , β) is an N × 1 vector, and ρ(Ct , Rt+1 , β) = (ρ(Ct , R1,t+1 , β), . . . , ρ(Ct , Rl,t+1 , β), . . . , ρ(Ct , RN,t+1 , β)) . After the implementation steps we explain the data here. The data points start from the second quarter of 1953 and end at the second quarter of 2001. This is a slightly shorter span than Chen and Ludvigson (2009) since we did not want to use missing data cells for certain variables. So N JT = 21 (number of orthogonality restrictions). At first we try KT = 3 like Chen and Ludvigson (2009), but this is the maximal number of sieve dimension, and we estimate sieve parameter with structural parameters and select the model unlike Chen and Ludvigson (2009). We also try KT = 5. When KT = 3, the total number of parameters to be 1There

are more instruments that are used by Chen and Ludvigson (2009), but only these three are available to us. Also we thank Sydney Ludvigson to remind us the discrepancy in the unused instruments in her website and the Journal of Applied Econometrics website.

Downloaded by [York University Libraries] at 07:38 12 August 2014

38

estimated is 12, and if KT = 5, then this number is 18. We also use BIC to choose from three possible tuning parameter choices, λ1 = λ∗1 = λ2 = {0.000001, 1, 10}. The tuning parameters are taking the same value for ease of computation. So here we will compare our results with unpenalized sieve GMM by Chen (2007). As discussed above we apply a certain subset of the instruments, since the remainder are unavailable, and do not use missing data in the article by Chen and Ludvigson (2009). So our results corresponding to unpenalized sieve GMM will be slightly different than that by Chen and Ludvigson (2009). We provide the estimates for our adaptive elastic net GMM method. This is for the case of KT = 5. The time discount estimate ιˆ = 0.88 and the curvature of the utility curve parameter is φˆ = 0.66. The sieve parameter estimates are ζˆ0 = 0, ζˆj = 0, 0, 0, 0, 0, τˆj = 0.086, 0.078, 0.084, 0.073, 0.083, κˆj = 0.082, 0.072, 0.087, 0.086, 0.080, for j = 1, 2, 3, 4, 5, respectively. To compare if we use sieve GMM with imposing KT = 5 as the true dimension of the sieve, we get ιˆsg = 0.86 and φˆ sg = 0.73 for time discount and curvature parameter, where subscript sg denotes Chen and Ludvigson (2009) and Chen (2007) with λ1 = λ∗1 = λ2 = 0. So the results are basically the same as ours for these two parameters. However, the estimates of sieve parameters in unpenalized sieve GMM case is ζˆsg,0 = 0, ζˆsg,j = 0 for all j = 1, . . . , 5, and τˆsg,j = 0.083, 0.079, 0.079, 0.079, 0.076, κˆ sg,j = 0.089, 0.086, 0.086, 0.084, 0.082, respectively, for j = 1, . . . , 5. So the habit function in the sieve GMM with KT = 5 is estimated as zero (on the boundary); our method gives the same result. Chen and Ludvigson (2009) fit KT = 3 for the sieve. We provide the estimates for our adaptive elastic net GMM method in this case as well. We also reestimate Chen and Ludvigson (2009). In the adaptive elastic net GMM, the time discount estimate is ιˆ = 0.93 and the curvature of the utility curve parameter is φˆ = 0.64. The sieve parameter estimates are ζˆ0 = 0, ζˆj = 0, for j = 1, 2, 3 τˆj = 0.057, 0.054, 0.064, κˆj = 0.067, 0.066, 0.058, for j = 1, 2, 3, respectively. To compare with our method we use the sieve GMM by Chen and Ludvigson (2009) with imposing KT = 3 as the true dimension of the sieve, and get ιˆsg = 0.94 and φˆ sg = 0.71 for time discount and curvature parameter, where subscript sg denotes Chen and Ludvigson (2009) and Chen (2007) with λ1 = λ∗1 = λ2 = 0. So the results are again basically the same as ours for these two parameters. However, the estimates of sieve parameters in unpenalized sieve GMM case is ζˆsg,0 = 0, ζˆsg,j = 0.022, 0.025, 0.019 for all j = 1, . . . , 3, and τˆsg,j = 0.051, 0.055, 0.056, κˆ sg,j = 0.076, 0.075, 0.075, respectively, for j = 1, . . . , 3. So the habit function in the adaptive elastic net GMM with KT = 3 is estimated as zero (on the boundary), but with the sieve GMM, the estimate of the habit function is positive. Chen and Ludvigson (2009), with a larger instrument set than we used for their case, find the time parameter estimate to be 0.99 and curvature to be 0.76, and the habit function was positive at KT = 3. Note that we use time series data, and as we suggest after our theorems, this is plausible given our technique and the structure of the proofs. We now discuss how our assumptions fit this application. First, all of our parameters are uniformly bounded in this ap-

Journal of Business & Economic Statistics, January 2014

plication, which is discussed by Chen and Ludvigson (2009, p. 1067). Then the second issue is whether the uniform convergence of partial derivative is plausible. This will be satisfied in iid case through condition 3.5M by Chen (2007). This amounts to uniformly bounded partial derivatives, Lipschitz continuity of partial derivatives, and log of covering numbers to be growing less than rate T. Assumption 2 is related to convergence of the weight matrix, which is not restrictive, and it shows a relation between q and T, so q cannot grow fast. In our case q = N JT , where N is fixed, so this restricts the growth of JT . Assumption 3(i) is also similar to Assumption 2. For Assumption 3(ii), we have p = 3(KT + 1), since q = N JT ≥ 3(KT + 1) = p, given that N JT /T → 0 provides us the assumption. Assumption 3(iii) is satisfied here by imposing a value between 0 and 1 (including one) for the exponent in the elastic net based weights. Assumption 4 is concerned about the sample moment functions, since all of our variables are stationary, in terms of ratios, and bounded variables such as returns; we do not expect its 2 + l moment to grow larger than T 1/2 . Assumption 5 is a penalty function, and this is satisfied with using T in place of n. 6.

CONCLUSION

In the article here we analyze the adaptive elastic net GMM estimator. It can simultaneously select the model and estimate that. The new estimator also allows for a diverging number of parameters. The estimator is shown to have the oracle property, so we can estimate nonzero parameters with the standard GMM limit and the redundant parameters are set to zero by a datadependent technique. Commonly used AIC, BIC methods as well as our estimator face some nonuniform consistency issues of the estimators. If we have to select the model with AIC or BIC and then use GMM, this also has the nonuniform consistency issues and also does much worse than the adaptive elastic net estimator considered here. The issue with model selection (i.e., including AIC, BIC) is that all of them are uniformly inconsistent (from model selection perspective). So any arbitrarily local to zero coefficients cannot be selected as nonzero. Leeb and P˝otscher (2005) established that, to get selected, the order of the magnitude of local to zero coefficients should be larger than n−1/2 . Between 0 and the magnitude of n−1/2 , the model selection is indeterminate. We study the selection issue of local to zero coefficients in the GMM and extend the results by Leeb and P˝otscher (2005) to the GMM with a diverging number of parameters. APPENDIX Proof of Theorem 1(i). Huang, Horowitz, and Ma (2008) analyzed the least squares with Bridge penalty with a diverging number of parameters. Here we extend that to a diverging number of moment conditions with parameters and the Adaptive Elastic Net penalty in nonlinear GMM. Starting with βˆenet definition (1 + λ2 /n)Sn (βˆenet ) ≤ (1 + λ2 /n)Sn (β0 ),

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

which is  n 

 gi (βˆenet )

Wn

 n 

i=1

 gi (βˆenet ) + λ1

+ λ2 βˆenet 22 ≤

n 

 gi (β0 )

Wn

 n 

i=1 p 

|βˆj,enet |

j =1

i=1



+ λ1

p 



gi (β0 )

i=1

|βj 0 | + λ2 β0 22 .

(A.1)

j =1

p  Then setting ιn = λ1 j =1 |βj 0 | + λ2 βj 0 22 = λ1 j ∈A  |βj 0 | + λ2 j ∈A βj20 , and by (A.1)  n    n   ιn ≥ gi (βˆenet ) Wn gi (βˆenet )

Downloaded by [York University Libraries] at 07:38 12 August 2014

i=1





n 

i=1

 gi (β0 )

Wn

i=1

 n 



n 

(A.2)  gi (β0 )

i=1

ˆ n (β)(β ˆ n (β)(β ¯ 0 − βˆenet )] Wn [G ¯ 0 − βˆenet )], + [G (A.3) via the mean value theorem and β¯ ∈ (β0 , βˆenet). We now try n ¯ G (β) 1/2 to simplify (A.3). In this way, set n = nWn [ i=1n i ](β0 − n g (β ) 1/2  ¯ = ∂gi (β)/∂β ¯ βˆenet ) and Dn = n1/2 Wn [ i=1n1/2i 0 ] and Gi (β) , which is a q × p matrix. The next two equations ((A.4)– (A.6)) are from the article by Huang, Horowitz, and Ma (2008, p. 603). We use them to illustrate our point. Then it is clear that (A.3) can be written as −

Next, %    n %% n % % i=1 gi (β0 ) i=1 gi (β0 ) % −1 E%  % % % n1/2 n1/2       n   n g (β ) g (β ) i 0 i 0 i=1 i=1 = tr E −1 . n1/2 n1/2 = tr(Iq ) = q,

− ιn ≤ 0,

(A.4)

which provides us n − Dn 22 − Dn 22 − ιn ≤ 0.

Using (A.8) in (A.7) with the definition of Dn , with probability approaching 1,

× βˆenet − β0 22 ≥ n2 bβˆenet − β0 22 ,

(A.10)

with probability approaching 1 and seeing (6) with remembering that elastic net is a subcase of βˆw . Next use (A.9) and (A.10) in (A.6) to have n2 bEβˆenet − β0 22 ≤ O(nq) + O(λ1 pA + λ2 pA ), (A.11)   by seeing that λ1 j ∈A |βj 0 | + λ2 j ∈A βj20 = O(λ1 pA + λ2 pA ), given that nonzero parameters can be all constants at most. So by Assumptions 3 and 5, since λ1 /n → 0, λ2 /n → 0, pA /n → 0, q/n → 0, we have P

βˆenet − β0 22 → 0.

n − Dn 2 ≤ Dn 2 + ι1/2 n ,



and by triangle inequality ι1/2 n .

(A.5)

By using (A.4) and (A.5) with simple algebra, we have n 22 ≤ 6Dn 22 + 3ιn .

(A.9)

Next by the definition of n  and Assumption 2, we have    n ¯  i=1 Gi (β) 2 2  n 2 = n (β0 − βˆenet ) n   n & ¯ & i=1 Gi (β) (β0 − βˆenet ) × −1 n     n n ¯  ¯  G ( β) G ( β) i i i=1 i=1 ≥ n2 Eigmin −1 n n

Using the last inequality, we deduct

n 2 ≤ n − Dn 2 + Dn 2 ≤ 2Dn 2 +

(A.8)

EDn2 22 = O(nq).

gi (β0 )

i=1

2Dn n

with probability approaching 1, by substituting W = −1 and by Assumptions 2 and 3. Then by the definition of Dn and (A.4) %    n %% n % % i=1 gi (β0 ) i=1 gi (β0 ) % 2 −1 EDn 2 − nE %  % = o(1). % % n1/2 n1/2



ˆ n (β)(β ¯ 0 − βˆenet )] Wn = −2[G

n n

39

(A.6)

Note

%    n %% n % % i=1 gi (β0 ) i=1 gi (β0 ) % Wn nE % % % % n1/2 n1/2 %    n %% n % g (β ) g (β ) i 0 i 0 % % i=1 i=1 −nE % −1 % = o(1), % % n1/2 n1/2 (A.7)

Proof of Theorem 1(ii). The proof for the consistency of the estimator βˆw is the same as in the elastic net case in Theorem p 1(i). The only pdifference is the definition of ιn = λ1 j =1 wˆ j |βj 0 | + λ2 j =1 βj20 . Note that when βj 0 = 0, we   can write ιn = λ1 j ∈A wˆ j |βj 0 | + λ2 j ∈A βj20 . In other words, only nonzero coefficient weights play a role in term ιn . As in (A.11) n2 bEβˆw − β0 22



= O(nq) + O ⎝λ1

 j ∈A

wˆ j |βj 0 | + λ2



⎞ βj20 ⎠ .

(A.12)

j ∈A

The key issue is the order of the penalty terms. We are only interested in nonzero parameter weights. Defining

40

Journal of Business & Economic Statistics, January 2014

since λ1 /n → 0 by Assumption 5(i), and by Assumption 5(iv) n1−α ηγ → ∞. The last two equations illustrate that (A.14) &  −γ  ' ηˆ λ1 η−γ pA C λ1 ηˆ −γ pA C P = → 0. (A.19) n2 n2 η

ηˆ = minj ∈A |βˆj,enet |, we have for j ∈ A, wˆ j = Then λ1



1 1 ≤ γ. γ ˆ η ˆ |βj,enet |

ˆ j |βj 0 | j ∈A w 2 n



λ1 ηˆ −γ pA C , n2

(A.13)

Downloaded by [York University Libraries] at 07:38 12 August 2014

where we use maxj ∈A |βj 0 | ≤ C, where C is a generic positive constant. So we can write (A.13) &  −γ  ' ηˆ λ1 η−γ pA C λ1 ηˆ −γ pA C = . (A.14) 2 2 n n η In the above equation we first show  2 ηˆ = O(1). E η

2 Eβˆenet − β0 22 . η2

(A.15)

(A.16)

Next by Theorem 1(i) (Equation (A.11))   1 ˆenet − β0 22 = O qn + λ1 pA + λ2 pA , E β η2 n2 η2 since λ1 /n → 0, λ2 /n → 0, and pA ≤ q, the largest stochastic order is qn/n2 η2 = q/nη2 . By q = nν , and 0 < ν < 1 with n1−ν η2 → ∞ by Assumption 5(iii) clearly 1 Eβˆenet − β0 22 = o(1). η2

(A.17)

Next by (A.16) and (A.17) we have  2 ηˆ ≤ 2 + o(1). E η

Next on the right-hand side of (A.14) λ 1 pA C → 0, n nηγ

λ 2 pA n2

 = o(1),

(A.20)

by λ2 /n → 0, pA /n → 0. Also note that by q/n → 0 through Assumption 3 combining (A.19) and (A.20) in (A.12) above provides us

= O

( nq ) n2

 +O

λ1

 j ∈A

wˆ j |βj 0 | + λ2

 j ∈A

βj20



n2

= o(1).

(A.21) 

Proof of Theorem 2. In this proof we start by analyzing the GMM-Ridge estimator that is defined as follows:    n  n   ˆ βR = arg min gi (β) Wn gi (β) + λ2 β22 . β

i=1

i=1

Note that this estimator is similar to the elastic net estimator, if we set λ1 = 0, in elastic net estimator, then we obtain the GMM-Ridge estimator. So since the elastic net estimator is consistent, GMM-Ridge will be consistent as well. Define also n ˆ n (βˆR ) = i=1 ∂g i (βˆR ) . Then set β¯ ∈ the following q × p matrix G ∂β ˆ n (β) ˆ n (.) evaluated at β. ¯ is the value of G ¯ A (β0 , βˆR ). Note that G applied to first-order conditions mean value theorem around β  0 provides, with gn (β0 ) = ni=1 gi (β0 ), ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]−1 βˆR = −[G (A.22)

Also using the mean value theorem with first-order conditions, adding and subtracting λ2 β0 from first-order conditions yields ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]−1 βˆR − β0 = −[G ˆ n (βˆR ) Wn gn (β0 ) + λ2 β0 ]. × [G

(A.23)

We need the following expressions by using (A.22)

clearly ( ηηˆ )2 = op (1). Note that

Then by (A.18) we have  −γ ηˆ = Op (1). η

 =O

ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ¯ 0 ]. × [G

So (A.15) is shown, and by Markov’s inequality  2 ηˆ = Op (1), η

 −γ  2 −γ /2 ηˆ ηˆ = . η η

n2

Eβˆw − β0 22

First see that by simple algebraic inequality as in (6.13) by Zou and Zhang (2009),  2 ηˆ 2 E ≤ 2 + 2 E(ηˆ − η)2 η η ≤ 2+

Next in (A.12) we have  λ2 j ∈A βj20

ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ¯ 0] βˆR [G (A.18)

ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (β)β ˆ n (βˆR ) Wn G ¯ 0 ] = −[G ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]−1 × [G ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ¯ 0] × [G

(A.24)

ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]βˆR βˆR [G ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ¯ 0 ] = [G ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]−1 × [G ˆ n (β)β ˆ n (βˆR ) Wn gn (β0 ) − G ˆ n (βˆR ) Wn G ¯ 0 ]. (A.25) × [G

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

Next the aim is to rewrite the following GMM-Ridge objective function via a mean value expansion    n  n   gi (βˆR ) Wn gi (βˆR ) + λ2 βˆR 22 i=1

i=1 

ˆ n (β)( ¯ βˆR − β0 ) = gn (β0 ) Wn gn (β0 ) + gn (β0 ) Wn G ˆ n (β) ¯  Wn gn (β0 ) + (βˆR − β0 ) G

41

ridge is a subcase of elastic net with λ1 = 0 imposed on the elastic net. Note that the right-hand side expression in (A.24) is just the negative of the right-hand side of the expression in (A.25). As in (A.29), for the estimator βˆw we have the following where β¯w ∈ (β0 , βˆw ),  n    n   gi (βˆw ) Wn gi (βˆw ) + λ2 βˆw 22 i=1

ˆ n (β¯w ) Wn gn (β0 ) = gn (β0 ) Wn gn (β0 ) + βˆw [G

ˆ n (β) ˆ n (β)( ¯  Wn G ¯ βˆR − β0 ) + λ2 βˆR 22 . + (βˆR − β0 ) G

ˆ n (β¯w ) Wn G( ˆ β¯w )β0 ] + [G ˆ n (β¯w ) Wn gn (β0 ) −G

Downloaded by [York University Libraries] at 07:38 12 August 2014

(A.26) P ¯ 22 → When λ1 = 0, using Theorem 1 we see that βˆR − β 0. Then use Assumption 1 to have      ˆ n (βˆR ) ˆ n (β) ¯ G G  Wn βˆR + λ2 Ip βˆR n n      ˆ n (β) ˆ n (β) ¯ ¯ G G  ˆ = βR Wn + λ2 Ip βˆR n n     ˆ n (β) ¯ G  + βˆR oP (1)Wn + λ2 Ip βˆR , (A.27) n

where the oP (1) term comes from the uniform law of large numbers. Clearly the stochastic order of the second term is smaller than the first one on the right-hand side of (A.27). By using the same argument to get (A.27), we have        ˆ n (βˆR ) ˆ n (βˆR ) ˆ n (β) ¯ G G G Wn gn (β0 ) − Wn βˆR β0 n n n        ˆ n (β) ˆ n (β) ˆ n (β) ¯ ¯ ¯ G G G  Wn gn (β0 )− Wn = βˆR β0 n n n     ˆ n (β) ¯ G  + βˆR oP (1)Wn gn (β0 ) − oP (1)Wn β0 . n (A.28) Again the second term’s stochastic order is smaller than the first one in (A.28). Furthermore we can rewrite the right-hand side of (A.26) 

gn (β0 ) Wn gn (β0 ) ˆ n (β)β ˆ n (β) ˆ n (β) ¯  Wn gn (β0 ) − G ¯  Wn G ¯ 0] + βˆR [G ˆ n (β) ˆ n (β)β ˆ n (β) ¯  Wn gn (β0 ) − G ¯  Wn G ¯ 0 ] βˆR + [G ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆR + βˆR [G ˆ n (β)β ˆ n (β) ¯ 0 − β0 G ¯  Wn gn (β0 ) − gn (β0 ) Wn G ˆ n (β) ˆ n (β)β ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆR − gn (β0 ) Wn G ¯ 0 − βˆR [G ˆ n (β) ¯  Wn gn (β0 ) − β0 G − βˆR so1 βˆR ,

+

ˆ n (β¯w ) Wn G ˆ n (β¯w )β0 ] βˆw −G ˆ β¯w ) + λ2 Ip ]βˆw ˆ n (β¯w ) Wn G( + βˆw [G ˆ n (β¯w )β0 − β0 G ˆ n (β¯w ) Wn gn (β0 ) − gn (β0 ) Wn G ˆ n (β¯w ) Wn G ˆ n (β¯w )β0 . + β0 G

ˆ n (β) ˆ n (β)β ¯  Wn G ¯ 0 β0 G

P

ˆ β)β ˆ n (β) ˆ n (β) ¯ 0] ¯  Wn gn (β0 ) − G ¯  Wn G( βˆw [G ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ] = βˆw [G ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]−1 × [G ˆ n (β)β ˆ n (β) ˆ n (β) ¯  Wn gn (β0 ) − G ¯  Wn G ¯ 0] × [G ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆR − βˆw so2 βˆR . (A.31) = −βˆw [G The term so2 above comes from the same type of analysis done for the second and small stochastic order terms in (A.27) and (A.28). Next substitute (A.31) into (A.30) to have    n  n   ˆ ˆ gi (βw ) Wn gi (βw ) + λ2 βˆw 22 i=1

i=1 

ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆR = gn (β0 ) Wn gn (β0 ) − βˆw [G ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆw − βˆR [G ˆ n (β) ˆ n (β)β ˆ n (β) ¯  Wn G ¯ + λ2 Ip ]βˆw − gn (β0 ) Wn G ¯ 0 + βˆw [G ˆ n (β) ˆ n (β) ˆ n (β)β ¯  Wn gn (β0 ) + β0 G ¯  Wn G ¯ 0 − β0 G − βˆw so2 βˆR − βˆR so3 βˆw + βˆw so4 βˆw .



 n 

i=1



gi (βˆR )

Wn

 n 

  2 ˆ ˆ gi (βR ) + λ2 βR 2

i=1 

where so1 represents the small-order terms mentioned in (A.27) and (A.28). The equality is obtained through (A.24) and (A.25) via Assumption 1 and the consistency of GMM-Ridge since

(A.32)

Term so3 is the transpose of so2 , and term so4 comes from approximation error between β¯ and β¯w . Specifically note that βˆw so2 βˆR βˆR so3 βˆw βˆw so4 βˆw are smaller-order terms than the second, third, and fourth terms on the right-hand side of (A.32), respectively. Denote so5 = min(so1 , so2 , so3 , so4 ). Now subtract (A.29) from (A.32) to have    n  n   gi (βˆw ) Wn gi (βˆw ) + λ2 βˆw 22

i=1

(A.29)

(A.30)

Then see that by Theorem 1 β¯ − β¯w 22 → 0 and using (A.22)

i=1

ˆ n (β) ˆ n (β)β ¯  Wn G ¯ 0 = gn (β0 ) Wn gn (β0 ) + β0 G

i=1



ˆ n (β) ˆ n (β) ¯ Wn G ¯ + λ2 Ip ](βˆw − βˆR ) ≥ (βˆw − βˆR ) [G + (βˆw − βˆR ) so5 (βˆw − βˆR ).

(A.33)

The next analysis is very similar to equations (6.1)–(6.6) by Zou and Zhang (2009). After this important result see that by

42

Journal of Business & Economic Statistics, January 2014

the definitions of βˆw and βˆR λ1

p 

 n 

 gi (βˆw )

Wn

i=1



P ˆ n (βˆR ) Wn gn (β0 ) − n[G(β0 ) Wgn (β0 )] − oP (n)22 → 0. G

wˆ j (|βˆj,R | − |βˆj,w |)

j =1



In the same way we obtain (A.38)

 n 

 n 

 gi (βˆR )

i=1

Second, see that by gi (β) being independent Egn (β0 )gn (β0 ) − n → 0.

gi (βˆw ) + λ2 βˆw 22

i=1

Wn

(A.41)



 n 

 gi (βˆR ) +

 λ2 βˆR 22

E[gn (β0 ) −1 G(β0 )G(β0 ) −1 gn (β0 )] . (A.34)

i=1

= ntr{G(β0 ) −1 G(β0 )} + o(1)

Then also see that

* + p  + p wˆ j (|βˆj,R | − |βˆj,w |) ≤ , (wˆ j )2 βˆR − βˆw 2 . j =1

≤ npEigmax () + o(1), (A.35)

j =1

Downloaded by [York University Libraries] at 07:38 12 August 2014

Next benefit from (A.34), with (A.33) and (A.35) to have ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ](βˆw − βˆR ) (βˆw − βˆR ) [G * + + p  ˆ ˆ ˆ ˆ + (βw − βR ) so5 (βw − βR ) ≤ λ1 , (wˆ j )2 βˆR − βˆw 2 . j =1

(A.36) ˆ n (β) ˆ n (β) ˆ n (β) ¯  Wn G ¯ + λ2 Ip ) = Eigmin (G ¯ See that Eigmin (G ˆ n (β)) ¯ + λ2 . Use this in (A.36) to have Wn G - p ˆ j2 λ1 j =1 w , (A.37) βˆw − βˆR 2 ≤ ˆ n (β)) ˆ n (β) ¯  Wn G ¯ + λ2 + so5 Eigmin (G ˆ n (β)) ˆ n (β) ¯  Wn G ¯ is a term of larger stochastic where Eigmin (G order than so5 , which is explained in (A.32) and (A.29). We also want to modify the last inequality. By the consistency of P βˆw , βˆR , β¯ → β0 . Then with the uniform law of large numbers on the partial derivative we have by Assumptions 1–3 . .2    . G . ˆ n (β) ¯ ¯ G . ˆ n (β) . P  Wn − G(β0 ) W G(β0 ). → 0. . . . n n 2

¯ Then The last equation is true also for βˆw , βˆR replacing β. P ˆ n (β) ˆ n (β) ¯  Wn G ¯ − n2 [G(β0 ) W G(β0 )] − o(n2 )22 → G 0.

(A.38) Using lemma A0 by Newey and Windmeijer (2009b), modify (A.37) in the following way given the last equality, set W = −1 (since this is the efficient limit weight as shown by Hansen (1982)) - p ˆ j2 λ1 j =1 w ˆ ˆ . βw − βR 2 ≤ 2 n Eigmin (G(β0 ) −1 G(β0 )) + λ2 + oP (n2 ) (A.39) Now we consider the second part of the proof of this theorem. We use the GMM ridge formula. Note that from (A.23) ˆ n (β) ˆ n (βˆR ) Wn G ¯ + λ2 Ip ]−1 β0 βˆR − β0 = −λ2 [G

(A.42)

where we use  = G(β0 ) −1 G(β0 ). Now we modify (A.40) using (A.41) and (38), W = −1 . βˆR − β0 = −λ2 [n2 G(β0 ) −1 G(β0 ) + λ2 Ip + oP (n2 )]−1 β0 − [n2 G(β0 ) −1 G(β0 ) + λ2 Ip + oP (n2 )]−1 × [nG(β0 ) −1 gn (β0 ) + oP (n)].

(A.43)

Then see that   E βˆR − β0 22 ≤ 2λ22 [n2 Eigmin (G(β0 ) −1 G(β0 )) + λ2 + o(n2 )]−2 β0 22 + 2[n2 Eigmin (G(β0 ) −1 G(β0 )) + λ2 + o(n2 )]−2 × n2 E[gn (β0 ) −1 G(β0 )G(β0 ) −1 gn (β0 )] + o(n2 ) ≤ 2[n2 Eigmin (G(β0 ) −1 G(β0 ) + λ2 + o(n2 )]−2   × λ22 β0 2 + n3 pEigmax () + o(n2 ) ,

(A.44)

where the last inequality is by (A.42). Note that we do not use orders smaller than o(n2 ) in (A.44) since this will not make any difference in the proofs of the theorems below. Now use (A.39) and (A.44) to have   E βˆw − β0 22     ≤ 2E βˆR − β0 22 + 2E βˆw − βˆR 22 p 4λ22 β0 22 + 4n3 pB + o(n2 ) + 2λ21 E j =1 wˆ j2 ≤ . (A.45) [n2 b + λ2 + o(n2 )]2 See that b ≤ Eigmin (G(β0 ) −1 G(β0 )) = Eigmin (), Eigmax ().

B≥ 

Proof of Theorem 3(i). The proof is similar to proof of theorem 3.2 by Zou and Zhang (2009). The differences are due to nonlinear nature of the problem. Our upper bounds in Theorem 2 converge to zero at a different rate than Zou and Zhang (2009). To prove the theorem, we need to show the following (Note that by Kuhn-Tucker conditions of (1)), % %  n   % %  % c % ˆ  ∗ ˜ ˜ gi (β) % ≤ λ1 wˆ j → 1, P ∀j ∈ A , %2Gn,j (β) Wn % % i=1

ˆ n (β) ˆ n (βˆR ) Wn G ˆ n (βˆR ) Wn gn (β0 )]. ¯ + λ2 Ip ]−1 [G − [G

We try to modify the above equation a little.

= tr{G(β0 ) −1 E[gn (β0 )gn (β0 ) ]−1 G(β0 )}

n

˜ ∂g (β)

ˆ n (β) ˜ = i=1  i , and Ac = {j : βj 0 = 0, j = where G ∂β (A.40) 1, 2, . . . , p}. G ˆ n,j (β) ˜ denotes the jth column of the partial derivative matrix that corresponds to irrelevant parameters,

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

˜ Or we need to show evaluated at β. % %  n  % %  % c % ˆ  ∗ ˜ ˜ P ∃j ∈ A , %2Gn,j (β) Wn gi (β) % > λ1 wˆ j → 0. % % 

i=1

Now set η = minj ∈A |βj 0 |, ηˆ = minj ∈A |βˆj,enet |, and A = {j : βj 0 = 0, j = 1, 2, . . . , p}. So % %  n  % %  % c % ˆ  ∗ ˜ Wn ˜ % > λ1 wˆ j P ∃j ∈ A , %2Gn,j (β) gi (β) % % i=1 % %   n % %   % ˆ % ˜  Wn ˜ % > λ∗1 wˆ j , ηˆ > η/2 P %2Gn,j (β) gi (β) ≤ % % c

43

Next we can consider the first term on the right-hand side of (A.48) % %   n % %   % ˆ %  ∗ −γ ˜ Wn ˜ % > λ1 M , ηˆ > η/2 P %2Gn,j (β) gi (β) % % c j ∈A

i=1



⎤ % %2  n %  4M 2γ ⎣  %% ˆ % ˜  Wn ˜ % I{η≥η/2} ⎦. ≤ E gi (β) %Gn,j (β) ˆ % % λ∗2 c 1 j ∈A

i=1



j ∈A

i=1

Downloaded by [York University Libraries] at 07:38 12 August 2014

+ P [ηˆ ≤ η/2].

(A.50) So we try to simplify the term on the right-hand side of (A.50). Now we evaluate % %2  n %   %%  ˆ n,j (β) ˜ Wn ˜ %% gi (β) %G % % c j ∈A

(A.46)

Then as in equation (6.7) by Zou and Zhang (2009), we can show that

where the second inequality is due to Theorem 2. Then we can also have % %   n % %  % ˆ %  ∗ ˜ Wn ˜ % > λ1 wˆ j , ηˆ > η/2 P %2Gn,j (β) gi (β) % % c



i=1

% %  n % %   % ˆ %  ˜ ˜ ≤ P %2Gn,j (β) Wn gi (β) % > λ∗1 wˆ j , % % c 

j ∈A



+

P [|βˆj,enet | > M]

i=1

i=1

P [|βˆj,enet | > M],

(A.48)

j ∈Ac λ∗

1

1 where M = ( n3+α ) 2γ . Compared to Zou and Zhang (2009), M converges to zero faster. In (A.48) we consider the second term on the right-hand side. Via inequality (6.8) by Zou and Zhang (2009) and Theorem 2, we have

 j ∈Ac

P [|βˆj,enet | > M] ≤

Eβˆenet − β0 22 M2

≤4

i=1

. .2 n . .  . 2.  −1 ≤ 2n .G(βA,0 ) ∗ gi (βA,0 ). + oP (n2 ), (A.52) . .

% %   n % %   % ˆ %  ∗ −γ ˜ Wn ˜ % > λ1 M , ηˆ > η/2 P %2Gn,j (β) gi (β) ≤ % % c 

ˆ n (β)( ˆ n,j (β) ˜  Wn G ¯ β˜ − βA,0 )|2 , (A.51) |G

Analyze each term in (A.51). Note that β˜ is consistent if we go through the same steps as in Theorem 1 by Assumption 5 on λ∗1 . Then applying Assumption 1 with Assumption 2 (Uniform Law of Large Numbers) in the same way as in equation (6.9) by Zou and Zhang (2009), we have % %2  n %  %%  %  ˆ ˜ gi (βA,0 ) % 2 %Gn,j (β) Wn % % c

j ∈Ac

j ∈A

i=1

˜ and where we have β¯ ∈ (βA,0 , β), ' ¯ & ˜ = gi (βA,0 ) + ∂gi (β) (β˜ − βA,0 ). gi (β) ∂β 

i=1

ηˆ > η/2, |βˆj,enet | ≤ M +



j ∈Ac

λ2 β0 22 + n3 pB + λ21 p + o(n2 ) ≤ 16 2 , (A.47) [n2 b + λ2 + o(n2 )]2 η2

j ∈A

j ∈A

+2

Eβˆenet − β0 22 P [ηˆ ≤ η/2] ≤ η2 /4

j ∈A

i=1

% %2  n %  %%  % ˆ n,j (β) ˜  Wn ≤ 2 gi (βA,0 ) % %G % % c

λ22 β0 22 + 4n3 pB + λ21 p + o(n2 ) . [n2 b + λ2 + o(n2 )]2 M 2 (A.49)

2

−1 ∗

and use Assumption 3(i), where we put W = 2 [n−1 ni=1 Egi (βA,0 )gi (βA,0 ) ]−1 − −1 ∗ 2 → 0 as the definition of the efficient limit weight matrix for the case of nonzero parameters to have ⎡ % %⎤2  n %  %%  %  ˆ n,j (β) ˜ Wn gi (βA,0 ) %⎦ ≤ n3 B + o(n3 ), E⎣ %G % % c j ∈A

i=1

(A.53) where we use B ≥ Eigmax () ≥ Eigmax (G(βA,0 ) −1 ∗ G(βA,0 )). (A.54) n See that i=1 gi (βA,0 ) = OP (n1/2 ). In the same manner as in equation (6.9) by Zou and Zhang (2009), we have  ˆ n (β)( ˆ n,j (β) ˜  Wn G ¯ β˜ − βA,0 )|2 |G j ∈Ac 2 4 ˜ ≤ n4 |G(βA,0 ) −1 ∗ G(βA,0 )(β − βA,0 )| + oP (n ). (A.55)

44

Journal of Business & Economic Statistics, January 2014

Then by (A.54) and taking into account (A.55), we have  ˆ n (β)( ˆ n,j (β) ˜  Wn G ¯ β˜ − βA,0 )|2 |G j ∈Ac

≤ n4 B 2 β˜ − βA,0 22 + oP (n4 ).

(A.56)

Now we need to show that each square-bracketed term on the right-hand side of Equation (A.61) converges in probability to zero. We consider each of the right-hand side elements in (A.61). We start with the first square-bracketed term. The orders of the expressions are p λ22 → 0, nη2 n3 p → 0, nη2

Now substitute (A.53)–(A.56) into the term on the right-hand side of (A.50), we get ⎤ ⎡ % %2  n %   %% ˆ n,j (β) ˜  Wn ˜ %% I{η>η/2} ⎦ gi (β) E⎣ %G ˆ % % c j ∈A

p λ21 → 0, nη2 n3

i=1

≤ 2B 2 n4 E(β˜ − βA,0 22 I{η>η/2} ) ˆ + 2Bn3 + o(n4 ).

(A.57)

by λ21 /n3 → 0, λ22 /n3 → 0 via Assumption 5, and 1 p = 1−α 2 → 0, nη2 n η

Define the ridge-based version of β˜ with imposing λ∗1 = 0

Downloaded by [York University Libraries] at 07:38 12 August 2014

˜ 2 , 0) β(λ

⎧⎛ ⎫   n ⎪ n ⎨  ⎬   ⎪  ⎝ = arg min gi (βA ) Wn gi (βA ) + λ2 βj2 . β ⎪ ⎪ ⎩ i=1 ⎭ j ∈A i=1 (A.58)

Then using the arguments in the proof of Theorem 2 (Equation (A.39)), we have ˜ 2 , 0)2 β˜ − β(λ ≤

√ λ∗1 maxj ∈A wˆ j p

2 n2 Eigmin (G(βA,0 ) −1 ∗ G(βA,0 )) + λ2 + oP (n ) √ λ∗ ηˆ −γ p , (A.59) ≤ 2 1 n b + λ2 + oP (n2 )

 −1 where Eigmin (G(βA,0 ) −1 ∗ G(βA,0 )) ≥ Eigmin (G(β0 ) ∗ G(β0 )) ≥ b. Then follow the proof of Theorem 2 (i.e., Equation (A.45)), for the right-hand side term in (A.57),   E β˜ − βA,0 22 I{η>η/2} ˆ

≤ 4

−2γ p + o(n2 ) λ22 β0 22 + n3 pB + λ∗2 1 (η/2) . 2 2 2 (bn + λ2 + o(n ))

16 ≤ 2 η

'

i=1

λ22 β0 22

4 + 2 M +

&

λ22 β0 22 + Bn3 p + λ21 p + o(n2 ) [n2 b + λ2 + o(n2 )]2 ⎡

&

λ∗

1 The above is true since by Assumption 5, n3+α nγ (1−α) → ∞, ∗ λ1 1/2γ ] , and since γ > 0 and M = [ n3+α  ∗ 1/γ λ1 M 2n n1−α → ∞. = p n3+α

The other terms in the second term on the right-hand side of (A.61) are λ22 β0 22 → 0, n3 nM 2 p by λ22 /n3 → 0, β0 22 = O(p), nM 2 → 0 by the analysis of the dominating term above. Then also in the same way

λ21 p λ21 p = → 0, M 2 n4 n3 nM 2 where we use λ21 /n3 → 0, and the analysis of the dominating p term nM 2 → 0 above. Now we consider the last square-bracketed term in (A.61). 2γ 3 2γ 3 M 2γ λ2 p M 2γ λ∗2 p The orders of the expression are Mλ∗2n , λ∗2 2 , M λ∗2n p , λ∗2 η12γ , 1 1 1 1 respectively. Clearly the order of the third expression dominates that of the first and second expressions, given λ22 /n3 → 0. Now the order of the third term is M 2γ n3 p 1 = ∗ → 0, ∗2 λ λ1 1

4M 2γ ⎣ 3 (2n B + o(n3 )) + (2B 2 n4 + o(n4 )) λ∗2 1 

×

'

+ Bn p + λ21 p + o(n2 ) [n2 b + λ2 + o(n2 )]2 3

by Assumption 5(iv) and 3 with n1−ν η2 → ∞ with α ≤ ν. Note that β0 22 = O(p) by Assumption 1. Next we consider the second square-bracketed term on the right-hand side of (A.61). See that the dominating term in that expression is stochastic order of   p 1 → 0. O n M2

(A.60)

Now combine (A.47)–A.50 and (A.57) and (A.60) into (A.46), %  %  n  % %  % c % ˆ  ∗ ˜ Wn ˜ % > λ1 wˆ j P ∃j ∈ A , %2Gn,j (β) gi (β) % %

(A.62)

λ∗

1 )2γ definition. For the order of the last expresgiven M = ( n3+α sion

⎤ −2γ 2  (η/2) p + o(n ) λ22 β0 22 + n3 pB + λ∗2 1 ⎦. (bn2 + λ2 + o(n2 ))2 (A.61)

M 2γ λ∗2 λ∗1 1 1 2γ p p = M = → 0, 2γ η2γ n n2 η2γ λ∗2 1 η since λ∗1 /n → 0, p = nα , M definition, and by Assumption 5(iv). 

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

Proof of Theorem 3(ii). The proof technique is very similar to the proof of theorem 3.3 by Zou and Zhang (2009), given Theorems 2 and 3(i). After Theorem 3(i) result, it suffices to prove ) ( P min |β˜j | > 0 → 1. j ∈A

˜ 2 , 0) defined in (A.58) Then we can write the following with β(λ

45

Then we obtain the desired result since by Assumption 5(v) and 5(iv) the last two terms on right-hand side of (A.72) converges to zero faster than η.  Proof of Theorem 4. We now prove the limit result. First, define    −1 ˆ ˆ ˆ ˆ  I + λ2 (G(βaenet,A ) Wn G(βaenet,A )) zn = δ 1 + λ2 /n

˜ 2 , 0)j | − β˜ − β(λ ˜ 2 , 0)2 . (A.63) min |β˜j | > min |β(λ j ∈A

ˆ βˆaenet,A ))1/2 n−1/2 (βˆaenet,A − βA,0 ). ˆ βˆaenet,A ) Wn G( × (G(

j ∈A

Also see that ˜ 2 , 0)j | > min |βj 0 | − β(λ ˜ 2 , 0) − βA,0 2 . min |β(λ j ∈A

j ∈A

(A.64)

Then we need the following result. Following the proof of Theorem 1 and using Theorem 3, P

βˆaenet,A − βA,0 22 → 0,

Combine (A.63) and (A.64) to have ˜ 2 , 0) − βA,0 2 − β˜ − β(λ ˜ 2 , 0)2 . min |β˜j | > min |βj 0 | − β(λ

Downloaded by [York University Libraries] at 07:38 12 August 2014

j ∈A

and by (A.66)

j ∈A

P

˜ 2 , 0) − βA,0 22 → 0. β(λ

(A.65) Now we consider the last two terms on the right-hand side of (A.65). Similar to derivation of (A.43) and (A.44) we have ˜ 2 , 0) − βA,0 22 ≤ Eβ(λ

4λ22 βA,0 22 + 4n3 pB + o(n2 ) [n2 b + λ2 + o(n2 )]2

= O(p/n) = o(1), λ22 βA,0 22 /n4

(λ22 /n2 )(βA,0 22 /n2 )

by = O(p). Then by (A.59)

˜ 2 , 0)2 ≤ β˜ − β(λ

→ 0, and

2

(A.66) βA,0 22

√ λ∗1 ηˆ −γ p . n2 b + λ2 + oP (n2 )

=

(A.67)

See that

  √ ∗√ λ∗1 ηˆ −γ p −γ λ1 p −γ ( . (A.68) = O η/η) ˆ η n2 b + λ2 + oP (n2 ) n2   √ λ∗1 p −γ 1 1 λ∗1 p 1 = α/2 o(1), η = 1/2 n2 p n n ηγ n

(A.69)

by (A.62) (p = nα ), Assumptions 5(i),(iv). Then by Theorem 2 E[(η/η) ˆ 2] ≤ 2 +

Next by Assumptions 1 and 2, and considering the above results about consistency, we have . .2  . G( ˆ β(λ ˆ βˆaenet,A ) G( ˆ β(λ ˜ 2 , 0))  G( ˜ 2 , 0)) . G( . ˆ βˆaenet,A ) . − Wn Wn . . . . n n n n P

→ 0.

(A.73)

Next note that as in the article by Zou and Zhang (2009, p. 18), ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 ] ˜ 2 , 0)) Wn G( δ  [I + λ2 [G(

  1/2 −1/2 ˆ ˆ ˜ ˜ × [G(β(λ2 , 0)) Wn G(β(λ2 , 0))] n β˜ − ' ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 ) ˜ 2 , 0)) Wn G( = δ  (I + λ2 [G(

βA,0 1 + λ2 /n

ˆ β(λ ˆ β(λ ˜ 2 , 0)))1/2 n−1/2 λ2 βA,0 ˜ 2 , 0)) Wn G( × (G( n + λ2



&

ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 ) ˜ 2 , 0)) Wn G( + [δ  (I + λ2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0)))1/2 /n−1/2 (β˜ − β(λ ˜ 2 , 0)) Wn G( ˜ 2 , 0))] × (G(

2 E[ηˆ − η]2 η2

ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 ) ˜ 2 , 0)) Wn G( + [δ  (I + λ2 [G(

2 ˆ 2 , λ1 ) − β0 22 Eβ(λ η2 ' & 8 λ22 β0 22 + n3 pB + λ21 p + o(n2 ) ≤ 2+ 2 η [n2 b + λ2 + o(n2 )]2 ≤ 2+

= O(1),

(A.70) 0, β0 22

by λ1 /n → 0, λ2 /n → 0, p/n → = O(p), by (A.62). Using the same technique in the proof of Theorem 1(ii), substitute (A.69) and (A.70) into (A.68) to have √ λ∗1 ηˆ −γ p P → 0. (A.71) 2 2 n b + λ2 + oP (n ) Now use (A.66) and (A.69) with η = minj ∈A |βj 0 | to have 0 0 p 1 ˜ OP (1) − min |βj | > η − oP (1). (A.72) j ∈A n nα

ˆ β(λ ˆ β(λ ˜ 2 , 0)) Wn G( ˜ 2 , 0)))1/2 × (G( ˜ 2 , 0) − βA,0 )]. × n−1/2 (β(λ

(A.74)

The last term on the right-hand side of (A.74) can be rewritten as (by (A.23)) ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 + I } ˜ 2 , 0)) Wn G( {λ2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0))]1/2 n−1/2 [β(λ ˜ 2 , 0)) Wn G( ˜ 2 , 0) − βA,0 ] × [G( ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1/2 βA,0 ˜ 2 , 0)) Wn G( = −λ2 n−1/2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1/2 ˜ 2 , 0)) Wn G( − n−1/2 [G( ˆ β(λ ˜ 2 , 0)) Wn gn (βA,0 )] + oP (1), × [G( (A.75) n where gn (βA,0 ) = i=1 gi (βA,0 ). Note that in the above equation we have an asymptotically negligible term due to using ˜ 2 , 0). (A.23) where we have β¯ instead of β(λ

46

Journal of Business & Economic Statistics, January 2014



Via Theorem 1, and also using (A.73), with probability tending to 1, zn = T1 + T2 + T3 , where

=

n  i=1

'

ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 )] ˜ 2 , 0)) Wn G( T1 = δ  [(I + λ2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0))]1/2 n−1/2 λ2 ˜ 2 , 0)) Wn G( × [G(

βA,0 n + λ2

×

&

ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1/2 βA,0 ]. ˜ 2 , 0)) Wn G( − [δ  λ2 n−1/2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1 ] ˜ 2 , 0)) Wn G( T2 = δ  [I + λ2 [G( ˆ β(λ ˆ β(λ ˜ 2 , 0)) Wn G( ˜ 2 , 0))]1/2 n−1/2 (β˜ − β(λ ˜ 2 , 0)). × [G(

Downloaded by [York University Libraries] at 07:38 12 August 2014

2 + λ2 (G(βA,0 ) W G(βA,0 ))−1/2 βA,0 22 + oP (1) n   λ2 2 2 λ22 Bn 1 + ≤ βA,0 22 n (n + λ2 )2 bn 1 2 + λ22 βA,0 22 2 + oP (1) n bn = oP (1),

n 

ERi2 = n−1

i=1

  λ2 2 1 ˜ 2 , 0)22 1+ ≤ (Bn)β˜ − β(λ n bn  ' &2  √ λ∗1 ηˆ −γ p λ2 2 ≤ B 1+ bn [n2 b + λ2 + oP (n2 )] = oP (1),

by (A.71) and λ2 = o(n). For the term T3 we benefit from Liapunov central limit theorem. By Assumptions 1 and 2, ⎛ T3 = ⎝

n 

−1/2 ˆ ˜ ˆ β(λ ˜ 2 , 0))  ˆ −1 δ  [G( ∗ G(β(λ2 , 0))]

i=1

⎞1 ˆ β(λ ˜ 2 , 0))  ˆ −1 ⎠ (n1/2 ) × [G( ∗ ]gi (βA,0 )

2 (n1/2 ) + oP (1).

n 

−1/2 E[δ  [G(βA,0 ) −1 G(βA,0 ) ∗ G(βA,0 )]

i=1  −1  −1 × −1 ∗ gi (βA,0 )gi (βA,0 ) ∗ G(βA,0 )[G(βA,0 ) ∗

× G(βA,0 )]−1/2 δ] −1/2 = δ  [G(βA,0 ) −1 G(βA,0 ) −1 ∗ G(βA,0 )] ∗   n  × n−1 Egi (βA,0 )gi (βA,0 ) i=1  −1 −1/2 × −1 δ. ∗ G(βA,0 )[G(βA,0 ) ∗ G(βA,0 )]

 Then use n−1 ni=1 Egi (βA,0 )gi (βA,0 ) − ∗ 22 → 0. Take the limit of the term above, lim

n→∞

n 

ERi2

i=1 −1/2 = δ  [G(βA,0 ) −1 G(βA,0 ) −1 ∗ G(βA,0 )] ∗  −1 −1/2 × ∗ −1 δ. ∗ G(βA,0 )[G(βA,0 ) ∗ G(βA,0 )]

= δ  δ = 1. Next we show (6),(7), and 4, n 

via λ2 = o(n), βA,0 22 = O(p), and λ22 βA,0 22 /n2 → 0. Consider T2 similar to the above analysis and (A.59) T22

[G(βA,0 ) −1 ∗ ]gi (βA,0 )

−1/2 Next set Ri = δ  [G(βA,0 ) −1 [G(βA,0 ) −1 ∗ G(βA,0 )] ∗ ] So

Consider T1 , use Assumptions 1, 2, and 5 with Wn = −1 ˆ −1  ∗ , W = ∗ . 2. (I + λ2 (G(βA,0 ) W G(βA,0 ))−1 )(G(βA,0 ) ≤ . n. .2 . 1/2 λ2 βA,0 . × W G(βA,0 )) n + λ2 .2



gi (βA,0 ) . n1/2

ˆ β(λ ˆ β(λ ˜ 2 , 0))]−1/2 ˜ 2 , 0)) Wn G( T3 = δ  [G(   n  gi (βA,0 )  ˆ ˜ × G(β(λ2 , 0)) Wn . n1/2 i=1

T12

−1/2 δ  [G(βA,0 ) −1 ∗ G(βA,0 )]

i=1

n i=1

E|Ri |2+l → 0, for l > 0. See that by

1 −1/2 E[δ  (G(βA,0 ) −1 ∗ G(βA,0 )) n1+l/2

2+l × G(βA,0 ) −1 ∗ gi (βA,0 )] ' &1+l/2 n 1  B ≤ E−1/2 gi (βA,0 )2+l ∗ 2+l b n1+l/2 i=1

→ 0. d

So T3 → N (0, 1). The desired result then follows from zn =  T1 + T2 + T3 with probability approaching 1. ACKNOWLEDGMENTS Zhang’s research is supported by National Science Foundation Grants DMS-0654293, 1347844, 1309507, and National Institutes of Health Grant NIH/NC1 R01 CA-08548. We thank the co-editor Jonathan Wright, an associate editor, and two anonymous referees for their comments that have substantially improved the article. Mehmet Caner also thanks Andres Bredahl Kock for advice on consistency proof. [Received March 2012. Revised May 2013.]

Caner and Zhang: Adaptive Elastic Net for Generalized Methods of Moments

Downloaded by [York University Libraries] at 07:38 12 August 2014

REFERENCES Abadir, K. M., and Magnus, J. R. (2005), Matrix Algebra, New York: Cambridge University Press. [33] Ai, C., and Chen, X. (2003), “Efficient Estimation of Models With Conditional Moment Restrictions Containing Unknown Functions,” Econometrica, 71, 1795–1843. [31] Alfaro, L., Kalemli-Ozcan, S., and Volosovych, V. (2008), “Why Does not Capital Flow From Rich to Poor Countries? An Empirical Investigation,” Review of Economics and Statistics, 90, 347–368. [30] Andrews, D. W. K., and Lu, B. (2001), “Consistent Model and Moment Selection Procedures for GMM Estimation With Application to Dynamic Panel Data Models,” Journal of Econometrics, 101, 123–165. [35,36] Caner, M. (2009), “Lasso-Type GMM Estimator,” Econometric Theory, 25, 270–290. [30,35,36] Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric Models,” in Handbook of Econometrics (vol. 6b), eds. J. J. Heckman and E. E. Leamer, Oxford: North-Holland, chap. 76, pp. 5550–5623. [31,32,36,37,38] Chen, X., and Ludvigson, C. (2009), “Land of Addicts? An Empirical Investigation of Habit Based Asset Pricing Models,” Journal of Applied Econometrics, 24, 1057–1093. [31,36,37,38] Davidson, J. (1994), Stochastic Limit Theory, New York: Oxford University Press. [34] Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. [30] Fan, J., and Peng, H. (2004), “Nonconcave Penalized Likelihood With a Diverging Number of Parameters,” The Annals of Statistics, 32, 928–961. [31] Gao, X., and Huang, J. (2010), “Asymptotic Analysis of High Dimensional Lad Regression With Lasso,” Statistica Sinica, 20, 1485–1506. [32] Guggenberger, P., and Smith, R. J. (2008), “Generalized Empirical Likelihood Tests in Time Series Models With Potential Identification Failure,” Journal of Econometrics, 142, 134–161. [34] Han, C., and Phillips, P. C. B. (2006), “GMM With Many Moment Conditions,” Econometrica, 74, 147–192. [31] Hansen, L. P. (1982), “Large Sample Properties of Generalized Method of Moment Estimators,” Econometrica, 50, 1029–1054. [30,34,42] He, X., and Shao, Q. M. (2000), “On Parameters of Increasing Dimensions,” Journal of Multivariate Analysis, 75, 120–135. [31] Huang, J., Horowitz, J., and Ma, S. (2008), “Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models,” The Annals of Statistics, 36, 587–613. [30,31,38,39]

47

Huber, P. (1988), “Robust Regression: Asymptotics, Conjectures and Monte Carlo,” The Annals of Statistics, 1, 799–821. [31] Knight, K. (2008), “Shrinkage Estimation for Nearly-Singular Designs,” Econometric Theory, 24, 323–338. [30] Knight, K., and Fu, W. (2000), “Asymptotics for Lasso Type Estimators,” The Annals of Statistics, 28, 1356–1378. [30,32] Lam, C., and Fan, J. (2008), “Profile-Kernel Likelihood Inference With Diverging Number of Parameters,” The Annals of Statistics, 36, 2232–2260. [31] Leeb, H., and P˝otscher, B. (2005), ”Model Selection and Inference: Facts and Fiction,” Econometric Theory, 21, 21–59. [33,34,38] Liao, Z. (2011), “Adaptive GMM Shrinkage Estimation With Consistent Moment Selection,” working article. New Haven, CT: Department of Economics, Yale University. [31] Newey, W., and Powell, J. (2003), “Instrumental Variable Estimation of Nonparametric Models,” Econometrica, 71, 1557–1569. [31] Newey, W., and Windmeijer, F. (2009a), “GMM With Many Weak Moment Conditions,” Econometrica, 77, 687–719. [32] ——— (2009b), “Supplement to GMM With Many Weak Moment Conditions,” Econometrica, 77, 687–719. [32,33,42] Otsu, T. (2006), “Generalized Empirical Likelihood Inference for Nonlinear and Time Series Models Under Weak Identification,” Econometric Theory, 22, 513–527. [34] Portnoy, S. (1984), “Asymptotic Behavior of M-Estimators of p-Regression Parameters When p 2 /n is Large. I. Consistency,” The Annals of Statistics, 12, 1298–1309. [31] Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [30] Wang, H., Li, B., and Leng, C. (2009), “Shrinkage Tuning Parameter Selection With a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. [35] Wang, H., Li, R., and Tsai, C. (2007), “Tuning Parameter Selectors for the Smoothly Clipped Absolute Deviation Method,” Biometrika, 94, 553–568. [35] Zhang, H., and Lu, W. (2007), “Adaptive Lasso for Cox’s Proportional Hazards Model,” Biometrika, 37, 1–13. [35] Zou, H. (2006), “The Adaptive Lasso and Its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. [32] Zou, H., Hastie, T., and Tibshirani, R. (2007), “On the Degrees of Freedom of the Lasso,” The Annals of Statistics, 35, 2173–2192. [35] Zou, H., and Zhang, H. (2009), “On the Adaptive Elastic-Net With a Diverging Number of Parameters,” The Annals of Statistics, 37, 1733–1751. [30,32,33,34,35,40,41,42,43,45]

Adaptive Elastic Net for Generalized Methods of Moments.

Model selection and estimation are crucial parts of econometrics. This paper introduces a new technique that can simultaneously estimate and select th...
312KB Sizes 2 Downloads 3 Views