474 H. White 4. Artificial neural networks 4.1. General considerations In the previous section, we introduced artificial neural networks (ANNs) as an example of an approximation dictionary supporting highly nonlinear approximation. In this sec- tion, we consider ANNs in greater detail. Our attention is motivated not only by their flexibility and the fact that many powerful approximation methods can be viewed as special cases of ANNs (e.g., Fourier series, wavelets, and ridgelets), but also by two fur- ther reasons. First, ANNs have become increasingly popular in economic applications. Second, despite their increasing popularity, the application of ANNs in economics and other fields has often run into serious stumbling blocks, precisely reflecting the three key challenges to the use of nonlinear methods articulated at the outset. In this section we explore some further properties of ANNs that may help in mitigating or eliminating some of these obstacles, permitting both their more successful practical application and a more informed assessment of their relative usefulness. Artificial neural networks comprise a family of flexible functional forms posited by cognitive scientists attempting to understand the behavior of biological neural systems. Kuan and White (1994) provide a discussion of their origins and an econometric per- spective. Our focus here is on the ANNs introduced above, that is, the class of “single hidden layer feedforward networks”, which have the functional form (6)f(x,θ)= x α + q j=1 (x γ j )β j , where is a given activation function, and θ ≡ (α ,β ,γ ) , β ≡ (β 1 , ,β q ) , γ ≡ (γ 1 , ,γ q ) . (x γ j ) is called the “activation” of “hidden unit” j. Except for the case of ridgelets, ANNs generally take the γ j ’s to be free parameters, resulting in a parameterization nonlinear in the parameters, with all the attendant com- putational challenges that we would like to avoid. Indeed, these difficulties have been formalized by Jones (1997) and Vu (1998), who prove that optimizing such an ANN is an NP-hard problem. It turns out, however, that by suitably choosing the activation function , it is possible to retain the flexibility of ANNs without requiring the γ j ’s to be free parameters and without necessarily imposing the ridgelet activation function or schedule of γ j values, which can be somewhat cumbersome to implement in higher dimensions. This possibility is a consequence of results of Stinchcombe and White (1998) (“SW”), as foreshadowed in earlier results of Bierens (1990). Taking advantage of these results leads to parametric models that are nonlinear in the predictors, with the attendant advantages of flexibility, and linear in the parameters, with the attendant advantages of computational convenience. These computational advantages create the possibility of mitigating the difficulties formalized by Jones (1997) and Vu (1998).Wefirsttakeupthe results of SW that create these opportunities and then describe a method for exploiting Ch. 9: Approximate Nonlinear Forecasting Methods 475 them for forecasting purposes. Subsequently, we perform some numerical experiments that shed light on the extent to which the resulting methods may succeed in avoiding the documented difficulties of nonlinearly parameterized ANNs. 4.2. Generically comprehensively revealing activation functions In work proposing new specification tests with the property of consistency (that is, the property of having power against model misspecification of any form) Bierens (1990) proved a powerful and remarkable result. This result states essentially that for any random variable ε t and random vector X t , under general conditions E(ε t | X t ) = 0 with nonzero probability implies E(exp(X t γ)ε t ) = 0 for almost every γ ∈ Γ , where Γ is any nonempty compact set. Applying this result to the present context with ε t = Y t − f(X t ,θ ∗ ), Bierens’s result implies that if (with nonzero probability) E Y t − f X t ,θ ∗ | X t = μ(X t ) − f X t ,θ ∗ = 0, then for almost every γ ∈ Γ we have E exp X t γ Y t − f X t ,θ ∗ = 0. That is, if the model N is misspecified, then the prediction error ε t = Y t − f(X t ,θ ∗ ) resulting from the use of modelN is correlated with exp(X t γ)foressentiallyany choice of γ . Bierens exploits this fact to construct a specification test based on a choice for γ that maximizes the sample correlation between exp(X t γ) and the sample prediction error ˆε t = Y t − f(X t , ˆ θ). Stinchcombe and White (1998) show that Bierens’s (1990) result holds more gen- erally, with the exponential function replaced by any belonging to the class of generically comprehensively revealing (GCR) functions. These functions are “compre- hensively revealing” in the sense that they can reveal arbitrary model misspecifications (μ(X t ) − f(X t ,θ ∗ ) = 0 with nonzero probability); they are generic in the sense that almost any choice for γ will reveal the misspecification. An important class of functions that SW demonstrate to be GCR is the class of non- polynomial real analytic functions (functions that are everywhere locally given by a convergent power series), such as the logistic cumulative distribution function (cdf) or the hyperbolic tangent function, tanh. Among other things, SW show how the GCR functions can be used to test for misspecification in ways that parallel Bierens’s proce- dures for the regression context, but that also extend to specification testing beyond the regression context, such as testing for equality of distributions. Here, we exploit SW’s results for a different purpose, namely to obtain flexible para- meterizations nonlinear in predictors and linear in parameters. To proceed, we represent a q hidden unit ANN more explicitly as f q x,θ ∗ q = x α ∗ q + q j=1 x γ ∗ j β ∗ qj , 476 H. White where is GCR, and we let ε t = Y t − f q x,θ ∗ q . If, with nonzero probability, μ(X t ) − f q (x, θ ∗ q ) = 0, then for almost every γ ∈ Γ we have E X t γ ε t = 0. As Γ is compact, we can pick γ ∗ q+1 such that corr X t γ ∗ q+1 ,ε t corr X t γ ,ε t for all γ ∈ Γ , where corr(·, ·) denotes the correlation of the indicated variables. Let Γ m be a finite subset of Γ having m elements whose neighborhoods cover Γ . With chosen to be continuous, the continuity of the correlation operator then ensures that, with m sufficiently large, one can achieve correlations nearly as great as by optimizing over Γ by instead optimizing over Γ m . Thus one can avoid full optimization over Γ at potentially small cost by instead picking γ ∗ q+1 ∈ Γ m such that corr X t γ ∗ q+1 ,ε t corr X t γ ,ε t for all γ ∈ Γ m . This suggests a process of adding hidden units in a stepwise manner, stopping when |corr((X t γ ∗ q+1 ), ε t )| (or some other suitable measure of the predictive value of the marginal hidden unit) is sufficiently small. 5. QuickNet Wenow propose a family of algorithms based on these considerations that can work well in practice, called “QuickNet”. The algorithm requires specifying aprioria maximum number of hidden units, say ¯q, a GCR activation function , an integer m specifying the cardinality of Γ m , and a method for choosing the subsets Γ m . In practice, initially choosing ¯q to be on the order of 10 or 20 seems to work well; if the results indicate there is additional predictability not captured using ¯q hidden units, this limit can always be relaxed. (For concreteness and simplicity, suppose for now that ¯q<∞. More generally, one may take ¯q =¯q n , with ¯q n →∞as n →∞.) A common choice for is the logistic cdf, (z) = 1/(1 +exp(−z)). Ridgelet activation functions are also an appealing option. Choosing m to be 500–1000 often works well with Γ m consisting of a range of values (chosen either deterministically or, especially with more than a few predictors, ran- domly) such that the norm of γ is neither too small nor too large. As we discuss in greater detail below, when the norm of γ is too small, (X t γ)is approximately linear in X t , whereas when the norm of γ is too large, (X t γ) can become approximately constant in X t , both situations to be avoided. This is true not only for the logistic cdf Ch. 9: Approximate Nonlinear Forecasting Methods 477 but also for many other nonlinear choices for . In any given instance, one can experi- ment with these choices to observe the sensitivity or robustness of the method to these choices. Our approach also requires a method for selecting the appropriate degree of model complexity, so as to avoid overfitting, the second of the key challenges to the use of non- linear models identified above. For concreteness, we first specify a prototypical member of the QuickNet family using cross-validated mean squared error (CVMSE) for this pur- pose. Below, we also briefly discuss possibilities other than CVMSE. 5.1. A prototype QuickNet algorithm We now specify a prototype QuickNet algorithm. The specification of this section is generic, in that for succinctness we do not provide details on the construction of Γ m or the computation of CVMSE. We provide further specifics on these aspects of the algorithm in Sections 5.2 and 5.3. Our prototypical QuickNet algorithm is a form of relaxed greedy algorithm consisting of the following steps: Step 0: Compute ˆα 0 and ˆε 0t (t = 1, ,n)by OLS: ˆα 0 = (X X) −1 X Y, ˆε 0t = Y t − X t ˆα 0 . Compute CVMSE(0) (cross-validated mean squared error for Step 0; details are pro- vided below), and set q = 1. Step 1a: Pick Γ m , and find ˆγ q such that ˆγ q = argmax γ ∈Γ m ˆr X t γ , ˆε q−1,t 2 , where ˆr denotes the sample correlation between the indicated random variables. To perform this maximization, one simply regresses ˆε q−1,t on a constant and (X t γ) for each γ ∈ Γ m , and picks as ˆγ q the γ that yields the largest R 2 . Step 1b: Compute ˆα q , ˆ β q ≡ ( ˆ β q1 , , ˆ β qq ) by OLS, regressing Y t on X t and (X t ˆγ j ), j = 1, ,q, and compute ˆε qt (t = 1, ,n)as ˆε qt = Y t − X t ˆα q − q j=1 X t ˆγ j ˆ β qj . Compute CVMSE(q) and set q = q +1. If q>¯q, stop. Otherwise, return to Step 1a. Step 2: Pick ˆq such that ˆq = argmin q∈{1, , ¯q} CVMSE(q), and set the estimated parameters to be those associated with ˆq: ˆ θ ˆq ≡ ˆα ˆq , ˆ β ˆq , ˆγ 1 , , ˆγ ˆq . 478 H. White Step 3 (Optional): Perform nonlinear least squares for Y t using the functional form f ˆq (x, θ ˆq ) = x α + ˆq j=1 x γ j β j , starting the nonlinear iterations at ˆ θ ˆq . For convenience in what follows, we let ˆ θ denote the parameter estimates obtained via this QuickNet algorithm (or any other members of the family, discussed below). QuickNet’s most obvious virtue is its computational simplicity. Steps 0–2 involve only OLS regression; this is essentially a consequence of exploiting the linearity of f q in α and β. Although a potentially large number (m) of regressions are involved in Step 1a, these regressions only involve a single regressor plus a constant. These can be computed so quickly that this is not a significant concern. Moreover, the user has full control (through specification of m) over how intense a search is performed in Step 1a. The only computational headache posed by using OLS in Steps 0–2 results from multicollinearity, but this can easily be avoided by taking proper care to select predictors X t at the outset that vary sufficiently independently (little, if any, predictive power is lost in so doing), and by avoiding (either ex ante or ex post) any choice of γ in Step 1a that results in too little sample variation in (X t γ). (See Section 5.2 below for more on this issue.) Consequently, execution of Steps 0–2 of QuickNet can be fast, justifying our name for the algorithm. Above, we referred to QuickNet as a form of relaxed greedy algorithm. QuickNet is a greedy algorithm, because in Step 1a it searches for a single best additional term. The usual greedy algorithms add one term at a time, but specify full optimization over γ .In contrast, by restricting attention to Γ m , QuickNet greatly simplifies computation, and by using a GCR activation function , QuickNet ensures that the risk of missing pre- dictively useful nonlinearities is small. QuickNet is a relaxed greedy algorithm because it permits full adjustment of the estimated coefficients of all the previously included terms, permitting it to take full predictive advantage of these terms as the algorithm proceeds. In contrast, typical relaxed greedy algorithms permit only modest adjustment in the relative contributions of the existing and added terms. The optional Step 3 involves an optimization nonlinear in parameters, so here one may seem to lose the computational simplicity motivating our algorithm design. In fact, however, Steps 0–2 set the stage for a relatively simple computational exercise in Step 3. A main problem in the brute-force nonlinear optimization of ANN models is, for given q, finding a good (near global optimum) value for θ, as the objective function is typically nonconvex in nasty ways. Further, the larger is q, the more difficult this becomes and the easier it is to get stuck at relatively poor local optima. Typically, the optimization bogs down fairly early on (with the best fits seen for relatively small values of q), preventing the model from taking advantage of its true flexibility. (Our example in Section 7 illustrates these issues.) Ch. 9: Approximate Nonlinear Forecasting Methods 479 In contrast, ˆ θ produced by Steps 0–2 of QuickNet typically delivers much better fit than estimates produced by brute-force nonlinear optimization, so that local optimiza- tion in the neighborhood of ˆ θ produces a potentially useful refinement of ˆ θ. Moreover, the required computations are particularly simple, as optimization is done only with a fixed number ˆq of hidden units, and the iterations of the nonlinear optimization can be computed as a sequence of OLS regressions. Whether or not the refinements of Step 3 are helpful can be assessed using the CVMSE. If CVMSE improves after Step 3, one can use the refined estimate; otherwise one can use the unrefined (Step 2) estimate. 5.2. Constructing Γ m The proper choice of Γ m in Step 1a can make a significant difference in QuickNet’s performance. The primary consideration in choosing Γ m is to avoid choices that will result in candidate hidden unit activations that are collinear with previously included predictors, as such candidate hidden units will tend to be uncorrelated with the predic- tion errors, ˆε q−1,t and therefore have little marginal predictive power. As previously included predictors will typically include the original X t ’s, particular care should be taken to avoid choosing Γ m so that it contains elements (X t γ)that are either approx- imately constant or approximately proportional to X t γ . To see what this entails in a simple setting, consider the case of logistic cdf activa- tion function and a single predictor, X t , having mean zero. We denote a candidate nonlinear predictor as (γ 1 X t + γ 0 ).Ifγ 0 is chosen to be large in absolute value rela- tive to γ 1 X t , then (γ 1 X t + γ 0 ) behaves approximately as (γ 0 ), that is, it is roughly constant. To avoid this, γ 0 can be chosen to be roughly the same order of magnitude as sd(γ 1 X t ), the standard deviation of γ 1 X t . On the other hand, suppose γ 1 is chosen to be small relative to sd( X t ). Then (γ 1 X t + γ 0 ) varies approximately proportionately to γ 1 X t +γ 0 . To avoid this, γ 1 should be chosen to be at least of the order of magnitude of sd( X t ). A simple way to ensure these properties is to pick γ 0 and γ 1 randomly, independently of each other and of X t . We can pick γ 1 to be positive, with a range spanning modest multiples of sd( X t ) and pick γ 0 to have mean zero, with a variance that is roughly comparable to that of γ 1 X t . The lack of nonnegative values for γ 1 is of no consequence here, given that is monotone. Randomly drawing m such choices for (γ 0 ,γ 1 ) thus delivers a set Γ m that will be unlikely to contain elements that are either approximately constant or collinear with the included predictors. With these precautions, the elements of Γ m are nonlinear functions of X t and, as can be shown, are generically not linearly dependent on other functions of X t , such as previously included linear or nonlinear predictors. Choosing Γ m in this way thus generates a plausibly useful collection of candidate nonlinear predictors. In the multivariate case, similar considerations operate. Here, however, we replace γ 1 X t with γ 1 ( X t γ 2 ), where γ 2 is a direction vector, that is, a vector on S k−2 , the unit sphere in R k−1 , as in Candes’s ridgelet parameterization. Now the magnitude of γ 0 should be comparable to sd(γ 1 ( X t γ 2 )), and the magnitude of γ 1 should be chosen to 480 H. White be at least of the order of magnitude of sd( X t γ 2 ). One can proceed by picking a di- rection γ 2 on the unit sphere (e.g., γ 2 = Z/(Z Z) 1/2 is distributed uniformly on the unit sphere, provided Z is (k − 1)-variate unit normal). Then chose γ 1 to be positive, with a range spanning modest multiples of sd( X t γ 2 ) and pick γ 0 to have mean zero, with a variance that is roughly comparable to that of γ 1 ( X t γ 2 ).Drawingm such choices for (γ 0 ,γ 1 ,γ 2 ) thus delivers a set Γ m that will be unlikely to contain elements that are either approximately constant or collinear with the included predictors, just as in the univariate case. These considerations are not specific to the logistic cdf activation , but operate generally. The key is to avoid choosing a Γ m that contains elements that are either approximately constant or proportional to the included predictors. The strategies just described are broadly useful for this purpose and can be fine tuned for any particular choice of activation function. 5.3. Controlling overfit The advantageous flexibility of nonlinear modeling is also responsible for the second key challenge noted above to the use nonlinear forecasting models, namely the danger of over-fitting the data. Our prototype QuickNet uses cross-validation to choose the meta- parameter q indexing model complexity, thereby attempting to control the tendency of such flexible models to overfit the sample data. This is a common method, with a long history in statistical and econometric applications. Numerous other members of the QuickNet family can be constructed by replacing CVMSE with alternate measures of model fit, such as AIC [Akaike (1970, 1973)], C p [Mallows (1973)], BIC [Schwarz (1978), Hannan and Quinn (1979)], Minimum Description Length (MDL) [Rissanen (1978)], Generalized Cross-Validation (GCV) [Craven and Wahba (1979)], and others. We have specified CVMSE for concreteness and simplicity in our prototype, but, as results of Shao (1993, 1997) establish, the family members formed by using alternate model selection criteria in place of CVMSE have equivalent asymptotic properties under specific conditions, as discussed further below. The simplest form of cross-validation is “delete 1” cross-validation [Allen (1974), Stone (1974, 1976)] which computes CVMSE as CVMSE (1) (q) = 1 n n t=1 ˆε 2 qt(−t) , where ˆε qt(−t) is the prediction error for observation t computed using estimators ˆα 0(−t) and ˆ β qj (−t) ,j = 1, ,q, obtained by omitting observation t from the sample, that is, ˆε qt(−t) = Y t − X t ˆα 0(−t) − q j=1 X t ˆγ j ˆ β qj (−t) . Alternatively, one can calculate the “delete d” cross-validated mean squared error, CVMSE (d) [Geisser (1975)]. For this, let S be a collection of N subsets s of {1, ,n} Ch. 9: Approximate Nonlinear Forecasting Methods 481 containing d elements. Let ˆε qt(−s) be the prediction error for observation t computed using estimators ˆα 0(−s) and ˆ β qj (−s) , j = 1, ,q, obtained by omitting observations in the set s from the estimation sample. Then CVMSE (d) is computed as CVMSE (d) (q) = 1 hN s∈S t∈s ˆε 2 qt(−s) . Shao (1993, 1997) analyzes the model selection performance of these cross-validation measures and relates their performance to the other well-known model selection proce- dures in a context that accommodates cross-section but not time-series data. Shao (1993, 1997) gives general conditions establishing that given model selection procedures are either “consistent” or “asymptotically loss efficient”. A consistent procedure is one that selects the best q term (now q = q n ) approximation with probability approaching one as n increases. An asymptotically loss efficient procedure is one that selects a model such that the ratio of the sample mean squared error of the selected q term model to that of the truly best q term model approaches one in probability. Consistency of selection is a stronger property than asymptotic loss efficiency. The performance of the various procedures depends crucially on whether the model is misspecified (Shao’s “Class 1”) or correctly specified (Shao’s “Class 2”). Given our focus on misspecified models, Class 1 is that directly relevant here, but the compar- ison with performance under Class 2 is nevertheless of interest. Put succinctly, Shao (1997) show that for Class 1 under general conditions, CVMSE (1) is consistent for model selection, as is CVMSE (d) , provided d/n → 0[Shao (1997, Theorem 4; see also p. 234)]. These methods behave asymptotically equivalently to AIC, GCV, and Mallows’ C p . Further, for Class 1, CVMSE (d) is asymptotically loss efficient given d/n → 1 and q/(n−d) → 0[Shao (1997, Theorem 5)]. With these weaker conditions on d,CVMSE (d) behaves asymptotically equivalently to BIC. In contrast, for Class 2 (correctly specified models) in which the correct specification is not unique (e.g., there are terms whose optimal coefficients are zero), under Shao’s conditions, CVMSE (1) and its equivalents (AIC, GCV, C p ) are asymptotically loss ef- ficient but not consistent, as they tend to select more terms than necessary. In contrast, CVMSE (d) is consistent provided d/n → 1 and q/(n−d) → 0, as is BIC [Shao (1997, Theorem 5)]. The interested reader is referred to Shao (1993, 1997) and to the discus- sion following Shao (1997) for details and additional guidance and insight. Given these properties, it may be useful as a practical procedure in cross-section applications to compute CVMSE (d) for a substantial range of values of d to identify an interval of values of d for which the model selected is relatively stable, and use that model for forecasting purposes. In cross-section applications, the subsets of observations s used for cross-validation can be populated by selecting observations at random from the estimation data. In time series applications, however, adjacent observations are typically stochastically dependent, so random selection of observations is no longer appropriate. Instead, cross- validation observations should be obtained by removing blocks of contiguous observa- tions in order to preserve the dependence structure of the data. A straightforward analog 482 H. White of CVMSE (d) is “h-block” cross-validation [Burman, Chow and Nolan (1994)], whose objective function CVMSE h can be expressed as CVMSE h (q) = 1 n n t=1 ˆε 2 qt(−t:h) , where ˆε qt(−t:h) is the prediction error for observation t computed using estimators ˆα 0(−t:h) and ˆ β qj (−t:h) , j = 1, ,q, obtained by omitting a block of h observations on either side of observation t from the estimation sample, that is, ˆε qt(−t:h) = Y t − X t ˆα 0(−t:h) − q j=1 X t ˆγ j ˆ β qj (−t:h) . Racine (2000) shows that with data dependence typical of economic time series, CVMSE h is inconsistent for model selection in the sense of Shao (1993, 1997).An important contributor to this inconsistency, not present in the framework of Shao (1993, 1997), is the dependence between the observations of the omitted blocks and the re- maining observations. As an alternative, Racine (2000) introduces a provably consistent model selection method for Shao’s Class 2 (correctly specified) case that he calls “hv-block” cross- validation. In this method, for given t one removes v “validation” observations on either side of that observation (a block of n v = 2v +1 observations) and computes the mean- squared error for this validation block using estimates obtained from a sample that omits not only the validation block, but also an additional block of h observations on either side of the validation block. Estimation for a given t is thus performed for a set of n e = n−2h−2v −1 observations. (The size of the estimation set is somewhat different for t near 1 or near n.) One obtains CVMSE hv by averaging the CVMSE for each validation block over all n − 2v available validation blocks, indexed by t = v + 1, ,n − v. With suitable choice of h [e.g., h = int(n 1/4 ), as suggested by Racine (2000)], this approach can be proven to induce sufficient independence between the validation block and the remain- ing observations to ensure consistent variable selection. Although Racine (2000) finds that h = int(n 1/4 ) appears to work well in practice, practical choice of h is still an interesting area warranting further research. Mathematically, we can represent CVMSE hv as CVMSE hv (q) = 1 n − 2v n−v t=v+1 1 n v t+v τ =t −v ˆε 2 qτ(−t:h,v) . (Note that a typo appears in Racine’s article; the first summation above must begin at v +1, not v.) Here ˆε qτ(−t:h,v) is the prediction error for observation τ computed using estimators ˆα 0(−t:h,v) and ˆ β qj (−t:h:v) ,j = 1, ,q, obtained by omitting a block of h+v Ch. 9: Approximate Nonlinear Forecasting Methods 483 observations on either side of observation t from the estimation sample, that is, ˆε qτ(−t:h,v) = Y τ − X τ ˆα 0(−t:h,v) − q j=1 X τ ˆγ j ˆ β qj (−t:h:v) . Racine shows that CVMSE hv leads to consistent variable selection for Shao’s Class 2 case by taking h to be sufficiently large (controlling dependence) and taking v = n − int(n δ ) − 2h − 1 2 , where int(n δ ) denotes the integer part of n δ , and δ is chosen such that ln( ¯q)/ln(n) < δ<1. In some simulations, Racine observes good performance taking h = int(n γ ) with γ = 0.25 and δ = 0.5. Observe that analogous to the requirement d/n → 1in Shao’s Class 2 case, Racine’s choice analogously leads to 2v/n → 1. Although Racine does not provide results for Shao’s Class 1 (misspecified) case, it is quite plausible that for Class 1, asymptotic loss efficiency holds with the behavior for h and v as specified above, and that consistency of selection holds with h as above and with v/n → 0, parallel to Shao’s requirements for Class 1. In any case, the performance of Racine’s hv-block bootstrap generally and in QuickNet in particular is an appealing topic for further investigation. Some evidence on this point emerges in our examples of Section 7. Although hv-block cross-validation appears conceptually straightforward, one may have concerns about the computational effort involved, in that, as just described, on the order of n 2 calculations are required. Nevertheless, as Racine (1997) shows, there are computational shortcuts for block cross-validation of linear models that make this exercise quite feasible, reducing the computations to order nh 2 , a very considerable savings. (In fact, this can be further reduced to order n.) For models nonlinear in the pa- rameters the same shortcuts are not available, so not only are the required computations of order n 2 , but the computational challenges posed by nonconvexities and nonconver- gence are further exacerbated by a factor of approximately n. This provides another very strong motivation for working with models linear in the parameters. We comment further on the challenges posed by models nonlinear in the parameters when we discuss our empirical examples in Section 7. The results described in this section are asymptotic results. For example, for Shao’s results, q = q n may depend explicitly on n, with q n →∞, provided q n /(n−d) → 0. In our discussion of previous sections, we have taken q ¯q<∞, but this has been simply for convenience. Letting ¯q =¯q n such that ¯q n →∞with suitable restrictions on the rate at which ¯q n diverges, one can obtain formal results describing the asymptotic behavior of the resulting nonparametric estimators via the method of sieves. The interested reader is referred to Chen (2005) for an extensive survey of sieve methods. Before concluding this section, we briefly discuss some potentially useful variants of the prototype algorithm specified above. One obvious possibility is to use CVMSE hv to select the linear predictors in Step 0, and then to select more than one hidden unit term . chosen to be at least of the order of magnitude of sd( X t ). A simple way to ensure these properties is to pick γ 0 and γ 1 randomly, independently of each other and of X t . We can pick. parameterization. Now the magnitude of γ 0 should be comparable to sd(γ 1 ( X t γ 2 )), and the magnitude of γ 1 should be chosen to 480 H. White be at least of the order of magnitude of sd( X t γ 2 ) CVMSE (d) for a substantial range of values of d to identify an interval of values of d for which the model selected is relatively stable, and use that model for forecasting purposes. In cross-section