464 H. White = E Y t − μ(X t ) 2 + E μ(X t ) − X t β 2 . The final equality follows from the fact that for all β E Y t − μ(X t ) μ(X t ) − X t β = E E Y t − μ(X t ) μ(X t ) − X t β X t = E E Y t − μ(X t ) X t μ(X t ) − X t β = 0, because E[(Y t − μ(X t )) | X t ]=0. Thus, E Y t − X t β 2 = E Y t − μ(X t ) 2 + E μ(X t ) − X t β 2 (3)= σ 2 ∗ + μ(x) − x β 2 dH(x), where dH denotes the joint density of X t and σ 2 ∗ denotes the “pure PMSE”, σ 2 ∗ ≡ E[(Y t − μ(X t )) 2 ]. From (3) we see that the PMSE can be decomposed into two components, the pure PMSE σ 2 ∗ , associated with the best possible prediction (that based on μ), and the approximation mean squared error (AMSE), (μ(x) −x β) 2 dH(x),forx β as an ap- proximation to μ(x). The AMSE is weighted by dH , the joint density of X t , so that the squared approximation error is more heavily weightedinregions where X t is likely to be observed and less heavily weighted in areas where X t is less likely to be observed. This weighting forces the optimal approximation to be better in more frequently observed regions of the distribution of X t , at the cost of being less accurate in less frequently observed regions of the distribution of X t . It follows that to minimize PMSE it is necessary and sufficient to minimize AMSE. That is, because β ∗ minimizes PMSE, it also satisfies β ∗ = argmin β∈R k μ(x) − x β 2 dH(x). This shows that β ∗ is the vector delivering the best possible approximation of the form x β to the PMSE-best predictor μ(x) of Y t given X t = x, where the approximation is best in the sense of AMSE. For brevity, we refer to this as the “optimal approximation property”. Note that AMSE is nonnegative. It is minimized at zero if and only if for some β o ,μ(x) = x β o (a.s H ), that is, if and only if L is correctly specified. In this case, β ∗ = β o . An especially convenient property of β ∗ is that it can be represented in closed form. The first order conditions for β ∗ from problem (2) can be written as E X t X t β ∗ − E(X t Y t ) = 0. Define M ≡ E(X t X t ) and L ≡ E(X t Y t ).IfM is nonsingular then we can solve for β ∗ to obtain the desired closed form expression β ∗ = M −1 L. Ch. 9: Approximate Nonlinear Forecasting Methods 465 The optimal point forecast based on the linear model L given predictors X t is then given simply by Y ∗ t = l X t ,β ∗ = X t β ∗ . In forecasting applications we typically have a sample of data that we view as represen- tative of the underlying population distribution generating the data (the joint distribution of Y t and X t ), but the population distribution is itself unknown. Typically, we do not even know the expectations M and L required to compute β ∗ , so the optimal point forecast Y ∗ t is also unknown. Nevertheless, we can obtain a computationally conve- nient estimator of β ∗ from the sample data using the “plug-in principle”. That is, we replace the unknown M and L by sample analogs ˆ M ≡ 1 n n t=1 X t X t = X X/n and ˆ L ≡ 1 n n t=1 X t Y t = X Y/n, where X is the n × k matrix with rows X t ,Y is the n ×1 vector with elements Y t , and n is the number of sample observations available for estimation. This yields the estimator ˆ β ≡ ˆ M −1 ˆ L, which we immediately recognize to be the ordinary least squares (OLS) estimator. To keep the scope of our discussion tightly focused on the more practical aspects of the subject at hand, we shall not pay close attention to technical conditions underlying the statistical properties of ˆ β or the other estimators we discuss, and we will not state formal theorems here. Nevertheless, any claimed properties of the methods discussed here can be established under mild regularity conditions relevant for practical applica- tions. In particular, under conditions ensuring that the law of large numbers holds (i.e., ˆ M → M a.s., ˆ L → L a.s.), it follows that as n →∞, ˆ β → β ∗ a.s., that is, ˆ β con- sistently estimates β ∗ . Asymptotic normality can also be straightforwardly established for ˆ β under conditions sufficient to ensure the applicability of a suitable central limit theorem. [See White (2001, Chapters 2–5) for treatment of these issues.] For clarity and notational simplicity, we operate throughout with the implicit under- standing that the underlying regularity conditions ensure that our data are generated by an essentially stationary process that has suitably controlled dependence. For cross- section or panel data, it suffices that the observations are independent and identically distributed (i.i.d.). In time series applications, stationarity is compatible with consider- able dependence, so we implicitly permit only as much dependence as is compatible with the availability of suitable asymptotic distribution theory. Our discussion thus ap- plies straightforwardly to unit root time-series processes after first differencing or other suitable transformations, such as those relevant for cointegrated processes. For sim- plicity, we leave explicit discussion of these cases aside here. Relaxing the implicit stationarity assumption to accommodate heterogeneity in the data generating process is straightforward, but the notation necessary to handle this relaxation is more cumber- some than is justified here. Returning to our main focus, we can now define the point forecast based on the linear model L using ˆ β for an out-of-sample predictor vector, say X n+1 . This is computed 466 H. White simply as ˆ Y n+1 = X n+1 ˆ β. We italicized “out-of-sample” just now to emphasize the fact that in applications, fore- casts are usually constructed based on predictors X n+1 not in the estimation sample, as the associated target variable (Y n+1 ) is not available until after X n+1 is observed, as we discussed at the outset. The point of the forecasting exercise is to reduce our uncertainty about the as yet unavailable Y n+1 . 2.2. Nonlinearity A nonlinear parametric model is generated from a nonlinear parameterization. For this, let be a finite integer and let the parameter space be a subset of R .Letf be a function mapping R k × intoR. This generates the parametric model N ≡ m : R k → R | m(x) = f(x,θ),θ ∈ . The parameterization f (equivalently, the parametric model N) can be nonlinear in the predictors only, nonlinear in the parameters only, or nonlinear in both. Models that are nonlinear in the predictors are of particular interest here, so for convenience we call the forecasts arising from such models “nonlinear forecasts”. For now, we keep our discussion at the general level and later pay more particular attention to the special cases. Completely parallel to our discussion of linear models, we have that solving prob- lem (1) with M = N , that is, solving min m∈N E Y t − m(X t ) 2 yields the optimal forecasting function f(·,θ ∗ ), where (4)θ ∗ = argmin θ∈ E Y t − f(X t ,θ) 2 . Here θ ∗ is the PMSE-optimal coefficient vector. This delivers not only the best fore- cast for Y t given X t based on the nonlinear model N , but also the optimal nonlinear approximation to μ [see, e.g., White (1981)]. Now we have θ ∗ = argmin θ∈ μ(x) − f(x,θ) 2 dH(x). The demonstration is completely parallel to that for β ∗ , simply replacing x β with f(x,θ).Nowθ ∗ is the vector delivering the best possible approximation of the form f(x,θ) to the PMSE-best predictor μ(x) of Y t given X t = x, where, as before, the approximation is best in the sense of AMSE, where the weight is again dH , the density of the X t ’s. Ch. 9: Approximate Nonlinear Forecasting Methods 467 The optimal point forecast based on the nonlinear model N given predictors X t is thus given explicitly by Y ∗ t = f X t ,θ ∗ . The advantage of using a nonlinear model N is that nonlinearity in the predictors can afford greater flexibility and thus, in principle, greater forecast accuracy. Provided the nonlinear model nests the linear model (i.e., L ⊂ N ), it follows that min m∈N E Y t − m(X t ) 2 min m∈L E Y t − m(X t ) 2 , that is, the best PMSE for the nonlinear model is always at least as good as the best PMSE for the linear model. (The same relation also necessarily holds for AMSE.) A simple means of ensuring that N nests L is to include a linear component in f , for example, by specifying f(x,θ)= x α +g(x, β), where g is some function nonlinear in the predictors. Against the advantage of theoretically better forecast accuracy, using a nonlinear model has a number of potentially serious disadvantages relative to linear models: (1) the associated estimators can be much more difficult to compute; (2) nonlinear mod- els can easily overfit the sample data, leading to inferior performance in practice; and (3) the resulting forecasts may appear more difficult to interpret. It follows that the more appealing nonlinear methods will be those that retain the advantage of flexibility but that mitigate or eliminate these disadvantages relative to linear models. We now discuss considerations involved in constructing forecasts with these properties. 3. Linear, nonlinear, and highly nonlinear approximation When a parameterization is nonlinear in the parameters, there generally does not exist a closed form expression for the PMSE-optimal coefficient vector θ ∗ . One can neverthe- less apply the plug-in principle in such cases to construct a potentially useful estimator ˆ θ by solving the sample analog of the optimization problem (4) defining θ ∗ , which yields ˆ θ ≡ argmin θ∈ 1 n n t=1 Y t − f(X t ,θ) 2 . The point forecast based on the nonlinear model N using ˆ θ for an out-of-sample pre- dictor vector X n+1 , is computed simply as ˆ Y n+1 = f X n+1 , ˆ θ . The challenge posed by attempting to use ˆ θ is that its computation generally requires an iterative algorithm that may require considerable fine-tuning and that may or may not 468 H. White behave well, in that the algorithm may or may not converge, and, even with considerable effort, the algorithm may well converge to a local optimum instead of to the desired global optimum. These are the computational difficulties alluded to above. As the advantage of flexibility arises entirely from nonlinearity in the predictors and the computational challenges arise entirely from nonlinearity in the parameters, it makes sense to restrict attention to parameterizations that are “series functions” of the form (5)f(x,θ) = x α + q j=1 ψ j (x)β j , where q is some finite integer and the “basis functions” ψ j are nonlinear functions of x. This provides a parameterization nonlinear in x, but linear in the parameters θ ≡ (α ,β ) , β ≡ (β 1 , ,β q ) , thus delivering flexibility while simultaneously elim- inating the computational challenges arising from nonlinearity in the parameters. The method of OLS can now deliver the desired sample estimator ˆ θ for θ ∗ . Restricting attention to parameterizations having the form (5) thus reduces the prob- lem of choosing a forecasting model to the problem of jointly choosing the basis functions ψ j and their number, q. With the problem framed in this way, an important next question is, “What choices of basis functions are available, and when should one prefer one choice to another?” There is a vast range of possible choices of basis functions; below we mention some of the leading possibilities. Choosing among these depends not only on the properties of the basis functions, but also on one’s prior knowledge about μ, and one’s empirical knowledge about μ, that is, the data. Certain broad requirements help narrow the field. First, given that our objective is to obtain as good an approximation to μ as possible, a necessary property for any choice of basis functions is that this choice should yield an increasingly better approximation to μ as q increases. Formally, this is the requirement that the span (the set of all linear combinations) of the basis functions {ψ j ,j = 1, 2, } should be dense in the function space inhabited by μ. Here, this space is M ≡ L 2 (R k−1 , dH), the separable Hilbert space of functions m on R k−1 for which m(x) 2 dH(x)is finite. (Recall that x contains the constant unity, so there are only k − 1 variables.) Second, given that we are funda- mentally constrained by the amount of data available, it is also necessary that the basis functions should deliver a good approximation using as small a value for q as possible. Although the denseness requirement narrows the field somewhat, there is still an overwhelming variety of choices for {ψ j } that have this property. Familiar examples are algebraic polynomials in x of degree dependent on j , and in particular the related special polynomials, such as Bernstein, Chebyshev, or Hermite, etc.; and trigonometric polynomials in x, that is, sines and cosines of linear combinations of x corresponding to pre-specified (multi-)frequencies, delivering Fourier series. Further, one can combine different families, as in Gallant’s (1981) flexible Fourier form, which includes poly- nomials of first and second order, together with sine and cosine terms for a range of frequencies. Ch. 9: Approximate Nonlinear Forecasting Methods 469 Important and powerful extensions of the algebraic polynomials are the classes of piecewise polynomials and splines [e.g., Wahba and Wold (1975), Wahba (1990)]. Well- known types of splines are linear splines, cubic splines, and B-splines. The basis functions for the examples given so far are either orthogonal or can be made so with straightforward modifications. Orthogonality is not a necessary requirement, however. A particularly powerful class of basis functions that need not be orthogonal is the class of “wavelets”, introduced by Daubechies (1988, 1992). These have the form ψ j (x) = (A j (x)), where is a “mother wavelet”, a given function satisfying certain specific conditions, and A j (x) is an affine function of x that shifts and rescales x ac- cording to a specified dyadic schedule analogous to the frequencies of Fourier analysis. For a treatment of wavelets from an economics perspective, see Gencay, Selchuk and Whitcher (2001). Recall that a vector space is linear if (among other things) for any two elements of the space f and g, all linear combinations af + bg also belong to the space, where a and b are any real numbers. All of the basis functions mentioned so far define spaces of functions g q (x, β) ≡ q j=1 ψ j (x)β j that are linear in this sense, as taking a linear combination of two elements of this space gives a q j=1 ψ j (x)β j + b q j=1 ψ j (x)γ j = q j=1 ψ j (x)[aβ j + bγ j ], which is again a linear combination of the first q of the ψ j ’s. Significantly, the second requirement mentioned above, namely that the basis should deliver a good approximation using as small a value for q as possible, suggests that we might obtain a better approximation by not restricting ourselves to the functions g q (x, β), which force the inclusion of the ψ j ’s in a strict order (e.g., zero order polyno- mials first, followed by first order polynomials, followed by second order polynomials, and so on), but instead consider functions of the form g (x, β) ≡ j∈ ψ j (x)β j , where is a set of natural numbers (“indexes”) containing at most q elements, not nec- essarily the integers 1, ,q. The functions g are more flexible than the functions g q , in that g admits g q as a special case. The key idea is that by suitably choosing which basis functions to use in any given instance, one may obtain a better approximation for a given number of terms q. The functions g define a nonlinear space of functions, in that linear combinations of the form ag + bg K , where K also has q elements, generally have up to 2q terms, and are therefore not contained in the space of q-term linear combinations of the ψ j ’s. Consequently, functions of the form g are called nonlinear approximations in the approximation theory literature. Note that the nonlinearity referred to here is the nonlin- earity of the function spaces defined by the functions g . For given , these functions are still linear in the parameters β j , which preserves their appeal for us here. 470 H. White Recent developments in the approximation theory literature have provided consider- able insight into the question of which functions are better approximated using linear approximation (functions of the form g q ), and which functions are better approximated using nonlinear approximation (functions of the form g ). The survey of DeVore (1998) is especially comprehensive and deep, providing a rich catalog of results permitting a comparison of these approaches. Given sufficient a priori knowledge about the function of interest, μ, DeVore’s results may help one decide which approach to take. To gain some of the flavor of the issues and results treated by DeVore (1998) that are relevant in the present context, consider the following approximation root mean squared errors: σ q (μ, ψ) ≡ inf β μ(x) − g q (x, β) 2 dH(x) 1/2 , σ (μ, ψ) ≡ inf ,β μ(x) − g (x, β) 2 dH(x) 1/2 . These are, for linear and nonlinear approximation respectively, the best possible ap- proximation root mean squared errors (RMSEs) using qψ j ’s. (For simplicity, we are ignoring the linear term x α previously made explicit; alternatively, imagine we have absorbed it into μ.) DeVore devotes primary attention to one of the central issues of approximation theory, the “degree of approximation” question: “Given a positive real number a, for what functions μ does the degree of approximation (as measured here by the above approximation RMSE’s) behave as O(q −a )?” Clearly, the larger is a,the more quickly the approximation improves with q. In general, the answer to the degree of approximation question depends on the smoothness and dimensionality (k − 1) of μ, quantified in precisely the right ways. For linear approximation, the smoothness conditions typically involve the existence of a number of derivatives of μ and the finiteness of their moments (e.g., second moments), such that more smoothness and smaller dimensionality yield quicker approximation. The answer also depends on the particular choice of the ψ j ’s; suffice it to say that the details can be quite involved. In the nonlinear case, familiar notions of smoothness in terms of derivatives generally no longer provide the necessary guidance. To describe the smoothness notion relevant in this context, suppose for simplicity that {ψ j } forms an orthonormal basis for the Hilbert space in which μ lives. Then the optimal coefficients β ∗ j are given by β ∗ j = ψ j (x)μ(x) dH(x). As DeVore (1998, p. 135) states, “smoothness for [nonlinear] approximation should be viewed as decay of the coefficients with respect to the basis [i.e., the β ∗ j ’s]” (emphasis added). In particular, let τ = 1/(a +1/2). Then according to DeVore (1998, Theorem 4) σ (μ, ψ) = O(q −a ) if and only if there exists a finite constant M such that #{j : β ∗ j > z} M τ z −τ . For example, σ (μ, ψ) = O(q −1/2 ) if for some M we have #{j: β ∗ j > z} Mz −1 . Ch. 9: Approximate Nonlinear Forecasting Methods 471 An important and striking aspect of this view of smoothness is that it is relative to the basis. A function that is not at all smooth with respect to one basis may be quite smooth with respect to another. Another striking feature of results of this sort is that the dimensionality of μ no longer plays an explicit role, seemingly suggesting that non- linear approximation may somehow hold in abeyance the “curse of dimensionality” (the inability to well approximate functions in high-dimensional spaces without inordi- nate amounts of data). A more precise interpretation of this situation seems to be that smoothness with respect to the basis also incorporates dimensionality, such that a given decay rate for the optimal coefficients is a stronger condition in higher dimensions. In some cases, theory alone can inform us about the choice of basis functions. For example, it turns out, as DeVore (1998, p. 106) discusses, that with respect to nonlinear approximation, rational polynomials have approximation properties essentially equiva- lent to those of piecewise polynomials. In this sense, there is nothing to gain or lose in selecting one of these bases over another. In other cases, the helpfulness of the theory in choosing a basis depends on having quite specific knowledge about μ, for example, that it is very smooth (in the familiar sense) in some places and very rough in others or that it has singularities or discontinuities. For example, Dekel and Leviatan (2003) show that in this sense, wavelet approximations do not perform well in capturing singularities along curves, whereas nonlinear piecewise polynomial approximations do. Usually, however, we economists have little prior knowledge about the familiar smoothness properties of μ, let alone their smoothness with respect to any given ba- sis. As a practical matter, then, it may make sense to consider a collection of different bases, and let the data guide us to the best choice. Such a collection of bases is called a library. An example is the wavelet packet library proposed by Coifman and Wicker- hauser (1992). Alternatively, one can choose the ψ j ’s from any suitable subset of the Hilbert space. Such a subset is called a dictionary; the idea is once again to let the data help decide which elements of the dictionary to select. Artificial neural networks (ANNs) are an example of a dictionary, generated by letting ψ j (x) = (x γ j ) for a given “activa- tion function” , such as the logistic cdf ((z) = 1/(1 + exp(−z))), and with γ j any element of R k . For a discussion of artificial neural networks from an econometric per- spective, see Kuan and White (1994). Trippi and Turban (1992) contains a collection of papers applying ANNs to economics and finance. Approximating a function μ using a library or dictionary is called highly nonlinear approximation, as not only is there the nonlinearity associated with choosing q basis functions, but there is the further choice of the basis itself or of the elements of the dic- tionary. Section 8 of DeVore’s (1998) comprehensive survey is devoted to a discussion of the so far somewhat fragmentary degree of approximation results for approxima- tions of this sort. Nevertheless, some powerful results are available. Specifically, for sufficiently rich dictionaries D (e.g., artificial neural networks as above), DeVore and Temlyakov (1996) show [see DeVore (1998, Theorem 7)] that for a 1 2 and sufficiently smooth functions μ σ q (μ, D) C a q −a , 472 H. White where C a is a constant quantifying the smoothness of μ relative to the dictionary, and, analogous to the case of nonlinear approximation, we define σ q (μ, D) ≡ inf D,β μ(x) − g D (x, β) 2 dH(x) 1/2 , g D (x, β) ≡ ψ j ∈D ψ j (x)β j , where D is a q element subset of D. DeVore and Temlyakov’s result generalizes an earlier result for a = 1 2 of Maurey [see Pisier (1980)]. Jones (1992) provides a “greedy algorithm” and a “relaxed greedy algorithm” achieving a = 1 2 for a specific dictionary and class of functions μ, and DeVore (1998) discusses further related algorithms. The cases discussed so far by no means exhaust the possibilities. Among other no- table choices for the ψ j ’s relevant in economics are radial basis functions [Powell (1987), Lendasse et al. (2003)] and ridgelets [Candes (1998, 1999a, 1999b, 2003)]. Radial basis functions arise by taking ψ j (x) = p 2 (x, γ j ) , where p 2 (x, γ j ) is a polynomial of (at most) degree 2 in x with coefficients γ j , and is typically taken to be such that, with the indicated choice of p 2 ,(x,γ j ), (p 2 (x, γ j )) is proportional to a density function. Standard radial basis functions treat the γ j ’s as free parameters, and restrict p 2 (x, γ j ) to have the form p 2 (x, γ j ) =−(x − γ 1j ) γ 2j (x − γ 1j )/2, where γ j ≡ (γ 1j ,γ 2j ) , so that γ 1j acts as a centering vector, and γ 2j is a k ×k symmet- ric positive semi-definite matrix acting to scale the departures of x from γ 1j . A common choice for is = exp, which delivers (p 2 (x, γ j )) proportional to the multivariate normal density with mean γ 1j and with γ 2j a suitable generalized inverse of a given covariance matrix. Thus, standard radial basis functions have the form of a linear com- bination of multivariate densities, accommodating a mixture of densities as a special case. Treating the γ j ’s as free parameters, we may view the radial basis functions as a dictionary, as defined above. Candes’s ridgelets can be thought of as a very carefully constructed special case of ANNs. Ridgelets arise by taking ψ j (x) = γ −1/2 1j ˜x γ 2j − γ 0j /γ 1j , where ˜x denotes the vector of nonconstant elements of x (i.e., x = (1, ˜x ) ), γ 0j is real, γ 1j > 0, and γ 2j belongs to S k−2 , the unit sphere in R k−1 . The activation function is taken to belong to the space of rapidly decreasing functions (Schwartz space, a subset of C ∞ ) and to satisfy a specific admissibility property on its Fourier transform [see Candles (1999a, Definition 1)], essentially equivalent to the moment conditions z j (z)dz = 0,j= 0, ,k/2 − 1. Ch. 9: Approximate Nonlinear Forecasting Methods 473 This condition ensures that oscillates, has zero average value, zero average slope, etc. For example, = D h φ,thehth derivative of the standard normal density φ, is readily verified to be admissible with h = k/2. The admissibility of the activation function has a number of concrete benefits, but the chief benefit for present purposes is that it leads to the explicit specification of a countable sequence {γ j = (γ 0j ,γ 1j ,γ 2j ) } such that any function f square integrable on a compact set has an exact representation of the form f(x) ≡ ∞ j=1 ψ j (x)β ∗ j . The representing coefficients β ∗ j are such that good approximations can be obtained using g q (x, β) or g (x, β) as above. In this sense, the ridgelet dictionary that arises by letting the γ j ’s be free parameters (as in the usual ANN approach) can be reduced to a countable subset that delivers a basis with appealing properties. As Candes (1999b) shows, ridgelets turn out to be optimal for representing otherwise smooth multivariate functions that may exhibit linear singularities, achieving a rate of approximation of O(q −a ) with a = s/(k − 1), provided the sth derivatives of f exist and are square integrable. This is in sharp contrast to Fourier series or wavelets, which can be badly behaved in the presence of singularities. Candes (2003) provides an ex- tensive discussion of the properties of ridgelet regression estimators, and, in particular, certain shrinkage estimators based on thresholding coefficients from a ridgelet regres- sion. (By thresholding is meant setting to zero estimated coefficients whose magnitude does not exceed some pre-specified value.) In particular, Candes (2003) discusses the superiority in multivariate contexts of ridgelet methods to kernel smoothing and wavelet thresholding methods. In DeVore’s (1998) survey, Candes’s papers, and the references cited there, the inter- ested reader can find a wealth of further material describing the approximation prop- erties of a wide variety of different choices for the ψ j ’s. From a practical standpoint, however, these results do not yield hard and fast prescriptions about how to choose the ψ j ’s, especially in the circumstances commonly faced by economists, where one may have little prior information about the smoothness of the function of interest. Nev- ertheless, certain helpful suggestions emerge. Specifically: (i) nonlinear approximations are an appealing alternative to linear approximations; (ii) using a library or dictionary of basis functions may prove useful; (iii) ANNs, and ridgelets in particular, may prove useful. These suggestions are simply things to try. In any given instance, the data must be the final arbiter of how well any particular approach works. In the next section, we provide a concrete example of how these suggestions may be put into practice and how they interact with other practical concerns. . choice of the basis itself or of the elements of the dic- tionary. Section 8 of DeVore’s (1998) comprehensive survey is devoted to a discussion of the so far somewhat fragmentary degree of approximation. in more frequently observed regions of the distribution of X t , at the cost of being less accurate in less frequently observed regions of the distribution of X t . It follows that to minimize. another?” There is a vast range of possible choices of basis functions; below we mention some of the leading possibilities. Choosing among these depends not only on the properties of the basis functions,