Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 24 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
24
Dung lượng
693,89 KB
Nội dung
12 ICAbyNonlinearDecorrelationandNonlinearPCA This chapter starts by reviewing some of the early research efforts in independent component analysis (ICA), especially the technique based on nonlinear decorrelation, that was successfully used by Jutten, H ´ erault, and Ans to solve the first ICA problems. Today, this work is mainly of historical interest, because there exist several more efficient algorithms for ICA. Nonlineardecorrelation can be seen as an extension of second-order methods such as whitening and principal component analysis (PCA). These methods give components that are uncorrelated linear combinations of input variables, as explained in Chapter 6. We will show that independent components can in some cases be found as nonlinearly uncorrelated linear combinations. The nonlinear functions used in this approach introduce higher order statistics into the solution method, making ICA possible. We then show how the work on nonlineardecorrelation eventually lead to the Cichocki-Unbehauen algorithm, which is essentially the same as the algorithm that we derived in Chapter 9 using the natural gradient. Next, the criterion of nonlineardecorrelation is extended and formalized to the theory of estimating functions, and the closely related EASI algorithm is reviewed. Another approach to ICA that is related to PCA is the so-called nonlinear PCA. A nonlinear representation is sought for the input data that minimizes a least mean- square error criterion. For the linear case, it was shown in Chapter 6 that principal components are obtained. It turns out that in some cases the nonlinearPCA approach gives independent components instead. We review the nonlinearPCA criterion and show its equivalence to other criteria like maximum likelihood (ML). Then, two typical learning rules introduced by the authors are reviewed, of which the first one 239 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 240 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA is a stochastic gradient algorithm and the other one a recursive least mean-square algorithm. 12.1 NONLINEAR CORRELATIONS AND INDEPENDENCE The correlation between two random variables y 1 and y 2 was discussed in detail in Chapter 2. Here we consider zero-mean variables only, so correlation and covariance are equal. Correlation is related to independence in such a way that independent variables are always uncorrelated. The opposite is not true, however: the variables can be uncorrelated, yet dependent. An example is a uniform density in a rotated square centered at the origin of the (y 1 y 2 ) space, see e.g. Fig. 8.3. Both y 1 and y 2 are zero mean and uncorrelated, no matter what the orientation of the square, but they are independent only if the square is aligned with the coordinate axes. In some cases uncorrelatedness does imply independence, though; the best example is the case when the density of (y 1 y 2 ) is constrained to be jointly gaussian. Extending the concept of correlation, we here define the nonlinear correlation of the random variables y 1 and y 2 as E ff (y 1 )g (y 2 )g . Here, f (y 1 ) and g (y 2 ) are two functions, of which at least one is nonlinear. Typical examples might be polynomials of degree higher than 1, or more complex functions like the hyperbolic tangent. This means that one or both of the random variables are first transformed nonlinearly to new variables f (y 1 )g(y 2 ) and then the usual linear correlation between these new variables is considered. The question now is: Assuming that y 1 and y 2 are nonlinearly decorrelated in the sense E ff (y 1 )g (y 2 )g =0 (12.1) can we say something about their independence? We would hope that by making this kind of nonlinear correlation zero, independence would be obtained under some additional conditions to be specified. There is a general theorem (see, e.g., [129]) stating that y 1 and y 2 are independent if and only if E ff (y 1 )g (y 2 )g = E ff (y 1 )g E fg (y 2 )g (12.2) for all continuous functions f and g that are zero outside a finite interval. Based on this, it seems very difficult to approach independence rigorously, because the functions f and g are almost arbitrary. Some kind of approximations are needed. This problem was considered by Jutten and H ´ erault [228]. Let us assume that f (y 1 ) and g (y 2 ) are smooth functions that have derivatives of all orders in a neighborhood NONLINEAR CORRELATIONS AND INDEPENDENCE 241 of the origin. They can be expanded in Taylor series: f (y 1 ) = f (0) + f 0 (0)y 1 + 1 2 f 00 (0)y 2 1 + ::: = 1 X i=0 f i y i 1 g (y 2 ) = g (0) + g 0 (0)y 2 + 1 2 g 00 (0)y 2 2 + ::: = 1 X i=0 g i y i 2 where f i g i is shorthand for the coefficients of the i th powers in the series. The product of the functions is then f (y 1 )g (y 2 )= 1 X i=1 1 X j =1 f i g j y i 1 y j 2 (12.3) and condition (12.1) is equivalent to E ff (y 1 )g (y 2 )g = 1 X i=1 1 X j =1 f i g j E fy i 1 y j 2 g =0 (12.4) Obviously, a sufficient condition for this equation to hold is E fy i 1 y j 2 g =0 (12.5) for all indices i j appearing in the series expansion (12.4). There may be other solutions in which the higher order correlations are not zero, but the coefficients f i g j happen to be just suitable to cancel the terms and make the sum in (12.4) exactly equal to zero. For nonpolynomial functions that have infinite Taylor expansions, such spurious solutions can be considered unlikely (we will see later that such spurious solutions do exist but they can be avoided by the theory of ML estimation). Again, a sufficient condition for (12.5) to hold is that the variables y 1 and y 2 are independent and one of E fy i 1 g E fy j 2 g is zero. Let us require that E fy i 1 g =0 for all powers i appearing in its series expansion. But this is only possible if f (y 1 ) is an odd function; then the Taylor series contains only odd powers 1 3 5::: , and the powers i in Eq. (12.5) will also be odd. Otherwise, we have the case that even moments of y 1 like the variance are zero, which is impossible unless y 1 is constant. To conclude, a sufficient (but not necessary) condition for the nonlinear uncorre- latedness (12.1) to hold is that y 1 and y 2 are independent, and for one of them, say y 1 , the nonlinearity is an odd function such that f (y 1 ) has zero mean. The preceding discussion is informal but should make it credible that nonlinear correlations are useful as a possible general criterion for independence. Several things have to be decided in practice: the first one is how to actually choose the functions f g . Is there some natural optimality criterion that can tell us that some functions 242 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA yx 1 21 12 -m -m 2 1 2 x y Fig. 12.1 The basic feedback circuit for the H ´ erault-Jutten algorithm. The element marked with + is a summation are better than some other ones? This will be answered in Sections 12.3 and 12.4. The second problem is how we could solve Eq. (12.1), or nonlinearly decorrelate two variables y 1 y 2 . This is the topic of the next section. 12.2 THE H ´ ERAULT-JUTTEN ALGORITHM Consider the ICA model x = As . Let us first look at a 2 2 case, which was considered by H ´ erault, Jutten and Ans [178, 179, 226] in connection with the blind separation of two signals from two linear mixtures. The model is then x 1 = a 11 s 1 + a 12 s 2 x 2 = a 21 s 1 + a 22 s 2 H ´ erault and Jutten proposed the feedback circuit shown in Fig. 12.1 to solve the prob- lem. The initial outputs are fed back to the system, and the outputs are recomputed until an equilibrium is reached. From Fig. 12.1 we have directly y 1 = x 1 m 12 y 2 (12.6) y 2 = x 2 m 21 y 1 (12.7) Before inputting the mixture signals x 1 x 2 to the network, they were normalized to zero mean, which means that the outputs y 1 y 2 also will have zero means. Defining a matrix M with off-diagonal elements m 12 m 21 and diagonal elements equal to zero, these equations can be compactly written as y = x My Thus the input-output mapping of the network is y =(I + M) 1 x (12.8) THE CICHOCKI-UNBEHAUEN ALGORITHM 243 Note that from the original ICA model we have s = A 1 x , provided that A is invertible. If I + M = A ,then y becomes equal to s . However, the problem in blind separation is that the matrix A is unknown. The solution that Jutten and H ´ erault introduced was to adapt the two feedback coefficients m 12 m 21 so that the outputs of the network y 1 y 2 become independent. Then the matrix A has been implicitly inverted and the original sources have been found. For independence, they used the criterion of nonlinear correlations. They proposed the following learning rules: m 12 = f (y 1 )g (y 2 ) (12.9) m 21 = f (y 2 )g (y 1 ) (12.10) with the learning rate. Both functions f (:)g(:) are odd functions; typically, the functions f (y )=y 3 g (y ) = arctan(y) were used, although the method also seems to work for g (y )=y or g (y )= sign (y ) . Now, if the learning converges, then the right-hand sides must be zero on average, implying E ff (y 1 )g (y 2 )g = E ff (y 2 )g (y 1 )g =0 Thus independence has hopefully been attained for the outputs y 1 y 2 . A stability analysis for the H ´ erault-Jutten algorithm was presented by [408]. In the numerical computation of the matrix M according to algorithm (12.9,12.10), the outputs y 1 y 2 on the right-hand side must also be updated at each step of the iteration. By Eq. (12.8), they too depend on M , and solving them requires the inversion of matrix I + M . As noted by Cichocki and Unbehauen [84], this matrix inversion may be computationally heavy, especially if this approach is extended to more than two sources and mixtures. One way to circumvent this problem is to make a rough approximation y =(I + M) 1 x (I M)x that seems to work in practice. Although the H ´ erault-Jutten algorithm was a very elegant pioneering solution to the ICA problem, we know now that it has some drawbacks in practice. The algorithm may work poorly or even fail to separate the sources altogether if the signals are badly scaled or the mixing matrix is ill-conditioned. The number of sources that the method can separate is severely limited. Also, although the local stability was shown in [408], good global convergence behavior is not guaranteed. 12.3 THE CICHOCKI-UNBEHAUEN ALGORITHM Starting from the H ´ erault-Jutten algorithm Cichocki, Unbehauen, and coworkers [82, 85, 84] derived an extension that has a much enhanced performance and reliability. Instead of a feedback circuit like the H ´ erault-Jutten network in Fig. 12.1, Cichocki 244 ICABYNONLINEARDECORRELATIONANDNONLINEARPCAand Unbehauen proposed a feedforward network with weight matrix B , with the mixture vector x for input and with output y = Bx . Now the dimensionality of the problem can be higher than 2. The goal is to adapt the m m matrix B so that the elements of y become independent. The learning algorithm for B is as follows: B = f (y)g (y T )]B (12.11) where is the learning rate, is a diagonal matrix whose elements determine the amplitude scaling for the elements of y (typically, could be chosen as the unit matrix I ), and f and g are two nonlinear scalar functions; the authors proposed a polynomial and a hyperbolic tangent. The notation f (y) means a column vector with elements f (y 1 )::: f(y n ) . The argumentation showing that this algorithm will give independent components, too, is based on nonlinear decorrelations. Consider the stationary solution of this learning rule defined as the matrix for which E fBg = 0 , with the expectation taken over the density of the mixtures x . For this matrix, the update is on the average zero. Because this is a stochastic-approximation-type algorithm (see Chapter 3), such stationarity is a necessary condition for convergence. Excluding the trivial solution B =0 ,wemusthave E ff (y)g (y T )g =0 Especially, for the off-diagonal elements, this implies E ff (y i )g (y j )g =0 (12.12) which is exactly our definition of nonlineardecorrelation in Eq. (12.1) extended to n output signals y 1 :::y n . The diagonal elements satisfy E ff (y i )g (y i )g = ii showing that the diagonal elements ii of matrix only control the amplitude scaling of the outputs. The conclusion is that if the learning rule converges to a nonzero matrix B ,then the outputs of the network must become nonlinearly decorrelated, and hopefully independent. The convergence analysis has been performed in [84]; for general principles of analyzing stochastic iteration algorithms like (12.11), see Chapter 3. The justification for the Cichocki-Unbehauen algorithm (12.11) in the original articles was based on nonlinear decorrelations, not on any rigorous cost functions that would be minimized by the algorithm. However, it is interesting to note that this algorithm, first appearing in the early 1990’s, is in fact the same as the popular natural gradient algorithm introduced later by Amari, Cichocki, and Young [12] as an extension to the original Bell-Sejnowski algorithm [36]. All we have to do is choose as the unit matrix, the function g (y) as the linear function g (y) = y , and the function f (y) as a sigmoidal related to the true density of the sources. The Amari-Cichocki-Young algorithm and the Bell-Sejnowski algorithm were reviewed in Chapter 9 and it was shown how the algorithms are derived from the rigorous maximum likelihood criterion. The maximum likelihood approach also tells us what kind of nonlinearities should be used, as discussed in Chapter 9. THE ESTIMATING FUNCTIONS APPROACH * 245 12.4 THE ESTIMATING FUNCTIONS APPROACH * Consider the criterion of nonlinear decorrelations being zero, generalized to n random variables y 1 :::y n , shown in Eq. (12.12). Among the possible roots y 1 :::y n of these equations are the source signals s 1 :::s n . When solving these in an algorithm like the H ´ erault-Jutten algorithm or the Cichocki-Unbehauen algorithm, one in fact solves the separating matrix B . This notion was generalized and formalized by Amari and Cardoso [8] to the case of estimating functions. Again, consider the basic ICA model x = As , s = B x where B is a true separating matrix (we use this special notation here to avoid any confusion). An estimation function is a matrix-valued function F(x B) such that E fF(x B )g =0: (12.13) This means that, taking the expectation with respect to the density of x ,thetrue separating matrices are roots of the equation. Once these are solved from Eq. (12.13), the independent components are directly obtained. Example 12.1 Given a set of nonlinear functions f 1 (y 1 ):::f n (y n ) , with y = Bx , and defining a vector function f (y) = f 1 (y 1 ):::f n (y n )] T , a suitable estimating function for ICA is F(x B)= f (y)y T = f (Bx)(Bx) T (12.14) because obviously E ff (y)y T g becomes diagonal when B is a true separating matrix B and y 1 :::y n are independent and zero mean. Then the off-diagonal elements become E ff i (y i )y j g = E ff i (y i )g E fy j g =0 . The diagonal matrix determines the scales of the separated sources. Another estimating function is the right-hand side of the learning rule (12.11), F(x B)= f (y)g (y T )]B There is a fundamental difference in the estimating function approach compared to most of the other approaches to ICA: the usual starting point in ICA is a cost function that somehow measures how independent or nongaussian the outputs y i are, and the independent components are solved by minimizing the cost function. In contrast, there is no such cost function here. The estimation function need not be the gradient of any other function. In this sense, the theory of estimating functions is very general and potentially useful for finding ICA algorithms. For a discussion of this approach in connection with neural networks, see [328]. It is not a trivial question how to design in practice an estimation function so that we can solve the ICA model. Even if we have two estimating functions that both have been shaped in such a way that separating matrices are their roots, what is a relevant measure to compare them? Statistical considerations are helpful here. Note that in practice, the densities of the sources s i and the mixtures x j are unknown in 246 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA the ICA model. It is impossible in practice to solve Eq. (12.13) as such, because the expectation cannot be formed. Instead, it has to be estimated using a finite sample of x . Denoting this sample by x(1):::x(T ) , we use the sample function E fF(x B)g 1 T T X t=1 F(x(t) B) Its root ^ B is then an estimator for the true separating matrix. Obviously (see Chapter 4), the root ^ B = ^ Bx(1):::x(T )] is a function of the training sample, and it is meaningful to consider its statistical properties like bias and variance. This gives a measure of goodness for the comparison of different estimation functions. The best estimating function is one that gives the smallest error between the true separating matrix B and the estimate ^ B . A particularly relevant measure is (Fisher) efficiency or asymptotic variance, as the size T of the sample x(1) ::: x(T ) grows large (see Chapter 4). The goal is to design an estimating function that gives the smallest variance, given the set of observations x(t) . Then the optimal amount of information is extracted from the training set. The general result provided by Amari and Cardoso [8] is that estimating functions of the form (12.14) are optimal in the sense that, given any estimating function F , one can always find a better or at least equally good estimating function (in the sense of efficiency) having the form F(x B) = f (y)y T (12.15) = f (Bx)(Bx) T (12.16) where is a diagonal matrix. Actually, the diagonal matrix has no effect on the off-diagonal elements of F(x B) which are the ones determining the independence between y i y j ; the diagonal elements are simply scaling factors. The result shows that it is unnecessary to use a nonlinear function g (y) instead of y as the other one of the two functions in nonlinear decorrelation. Only one nonlinear function f (y) , combined with y , is sufficient. It is interesting that functions of exactly the type f (y)y T naturally emerge as gradients of cost functions such as likelihood; the question of how to choose the nonlinearity f (y) is also answered in that case. A further example is given in the following section. The preceding analysis is not related in any way to the practical methods for finding the roots of estimating functions. Due to the nonlinearities, closed-form solutions do not exist and numerical algorithms have to be used. The simplest iterative stochastic approximation algorithm for solving the roots of F(x B) has the form B = F(x B) (12.17) with an appropriate learning rate. In fact, we now discover that the learning rules (12.9), (12.10) and (12.11) are examples of this more general framework. EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE 247 12.5 EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE In most of the proposed approaches to ICA, the learning rules are gradient descent algorithms of cost (or contrast) functions. Many cases have been covered in previous chapters. Typically, the cost function has the form J (B) = E fG(y)g , with G some scalar function, and usually some additional constraints are used. Here again y = Bx , and the form of the function G and the probability density of x determine the shape of the contrast function J (B) . It is easy to show (see the definition of matrix and vector gradients in Chapter 3) that @J (B) @ B = E f( @G(y) @ y )x T g = E fg(y)x T g (12.18) where g(y) is the gradient of G(y) .If B is square and invertible, then x = B 1 y and we have @J (B) @ B = E fg(y)y T g(B T ) 1 (12.19) For appropriate nonlinearities G(y) , these gradients are estimating functions in the sense that the elements of y must be statistically independent when the gradient becomes zero. Note also that in the form E ffg(y)y T gg(B T ) 1 , the first factor g(y)y T has the shape of an optimal estimating function (except for the diagonal elements); see eq. (12.15). Now we also know how the nonlinear function g(y) can be determined: it is directly the gradient of the function G(y) appearing in the original cost function. Unfortunately, the matrix inversion (B T ) 1 in (12.19) is cumbersome. Matrix inversion can be avoided by using the so-called natural gradient introduced by Amari [4]. This is covered in Chapter 3. The natural gradient is obtained in this case by multiplying the usual matrix gradient (12.19) from the right by matrix B T B ,which gives E fg(y)y T gB . The ensuing stochastic gradient algorithm to minimize the cost function J (B) is then B = g(y)y T B (12.20) This learning rule again has the form of nonlinear decorrelations. Omitting the diagonal elements in matrix in g(y)y T , the off-diagonal elements have the same form as in the Cichocki-Unbehauen algorithm (12.11), with the two functions now given by the linear function y and the gradient g(y) . This gradient algorithm can also be derived using the relative gradient introduced by Cardoso and Hvam Laheld [71]. This approach is also reviewed in Chapter 3. Based on this, the authors developed their equivariant adaptive separation via independence (EASI) learning algorithm. To proceed from (12.20) to the EASI learning rule, an extra step must be taken. In EASI, as in many other learning rules for ICA, a whitening preprocessing is considered for the mixture vectors x (see Chapter 6). We first transform x linearly to z = Vx whose elements z i have 248 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA unit variances and zero covariances: E fzz T g = I . As also shown in Chapter 6, an appropriate adaptation rule for whitening is V = (I zz T )V (12.21) The ICA model using these whitened vectors instead of the original ones becomes z = VAs , and it is easily seen that the matrix VA is an orthogonal matrix (a rotation). Thus its inverse which gives the separating matrix is also orthogonal. As in earlier chapters, let us denote the orthogonal separating matrix by W . Basically, the learning rule for W would be the same as (12.20). However, as noted by [71], certain constraints must hold in any updating of W if the orthogonality is to be preserved at each iteration step. Let us denote the serial update for W using the learning rule (12.20), briefly, as W W + DW , where now D = g(y)y T . The orthogonality condition for the updated matrix becomes (W + DW)(W + DW) T = I + D + D T + DD T = I where WW T = I has been substituted. Assuming D small, the first-order approxi- mation gives the condition that D = D T ,or D must be skew-symmetric. Applying this condition to the relative gradient learning rule (12.20) for W ,wehave W = g(y)y T yg(y) T ]W (12.22) where now y = Wz . Contrary to the learning rule (12.20), this learning rule also takes care of the diagonal elements of g(y)y T in a natural way, without imposing any conditions on them. What is left now is to combine the two learning rules (12.21) and (12.22) into just one learning rule for the global system separation matrix. Because y = Wz = WVx , this global separation matrix is B = WV . Assuming the same learning rates for the two algorithms, a first order approximation gives B = WV + WV = g(y)y T yg(y) T ]WV + WV Wzz T W T WV] = yy T I + g(y)y T yg(y) T ]B (12.23) This is the EASI algorithm. It has the nice feature of combining both whitening and separation into a single algorithm. A convergence analysis as well as some experimental results are given in [71]. One can easily see the close connection to the nonlineardecorrelation algorithm introduced earlier. The concept of equivariance that forms part of the name of the EASI algorithm is a general concept in statistical estimation; see, e.g., [395]. Equivariance of an estimator means, roughly, that its performance does not depend on the actual value of the parameter. In the context of the basic ICA model, this means that the ICs can be estimated with the same performance what ever the mixing matrix may be. EASI was one of the first ICA algorithms which was explicitly shown to be equivariant. In fact, most estimators of the basic ICA model are equivariant. For a detailed discussion, see [69]. [...]... in [236] CONCLUDING REMARKS AND REFERENCES −(W+NPCA1) W+NPCA5 W+NPCA6 W+NPCA7 12.9 −(W+NPCA3) −(W+NPCA4) Fig 12.6 W+NPCA2 W+NPCA8 261 W+NPCA9 The separated images using the nonlinearPCA criterion and learning rule CONCLUDING REMARKS AND REFERENCES The first part of this chapter reviewed some of the early research efforts in ICA, especially the technique based on nonlinear decorrelations It was based... on the nonlinearities gi (y ) A more detailed analysis of the criterion (12.25) and its relation to ICA is given in the next section w 12.7 w THE NONLINEARPCA CRITERION ANDICA Interestingly, for prewhitened data, it can be shown [236] that the original nonlinearPCA criterion of Eq (12.25) has an exact relationship with other contrast func- 252 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA tions... input 260 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA W1 W2 −W3 −W4 W5 W6 −W7 −W8 −W9 Fig 12.5 The whitened images vector, y(t) is the output vector, W(t) is the weight matrix, and g is the nonlinearity in the NLPCA criterion The parameter is a kind of “forgetting constant” that should be close to unity The notation Tri means that only the upper triangular part of the matrix is computed and its... notation and approach [256] He chooses only one nonlinearity g1 (y ) = : : : = gn (y ) = g (y ): g (y ) = Efy 2 gp0 (y ) p(y ) (12.35) The function p(y ) is the density of y and p0 (y )its derivative Lambert [256] also gives several algorithms for minimizing this cost function Note that now in the whitened 254 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA data case the variance of y is equal to one and. .. autoassociators [252, 325] that also give nonlinearPCA In these methods, the approximating subspace is a curved manifold, while the solution to the problem posed earlier is still a linear subspace Only the coefficients corresponding to the principal components are nonlinear functions of It should be noted that x w w x wx w 250 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA minimizing the criterion (12.25)... sides by the orthogonal separating matrix MT , giving (WMT ) = = ( ) g(zT WT )WMT ] (12.44) T Mz) zT MT g(zT MT MWT )WMT ] g(WM g Wz zT MT 256 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA where we have used the fact that and using (12.43), we have MT M = I Denoting for the moment H = WMT H = g(Hs) sT g(sT HT )H] (12.45) This equation has exactly the same form as the original one (12.41) Geometrically... 262 ICABYNONLINEARDECORRELATIONANDNONLINEARPCA formulation of the ICA problem Due to this, recursive least mean-square algorithms can be derived; several versions like the symmetric, sequential, and batch algorithms are given in [236] Problems 3 12.1 In the H´ rault-Jutten algorithm (12.9,12.10), let f (y1 ) = y1 and g ( y2 ) = y2 e Write the update equations so that only x1 x2 m12 , and m21... here called nonlinear principal component analysis” (NLPCA) It should be emphasized that practically always when a well-defined linear problem is extended into a nonlinear one, many ambiguities and alternative definitions arise This is the case here, too The term nonlinearPCA is by no means unique There are several other techniques, like the method of principal curves [167, 264] or the nonlinear autoassociators... error than standard PCA Instead, the virtue of this criterion is that it introduces higher-order statistics in a simple manner via the nonlinearities gi Before going into any deeper analysis of (12.25), it may be instructive to see in a simple special case how it differs from linear PCAand how it is in fact related to ICA If the functions gi(y ) were linear, as in the standard PCA technique, and the number... equivalences of the NLPCA criterion with these can be established More details are given in [236] and Chapter 14 12.8 LEARNING RULES FOR THE NONLINEARPCA CRITERION Once the nonlinearities gi (y ) have been chosen, it remains to actually solve the minimization problem in the nonlinearPCA criterion Here we present the simplest learning algorithms for minimizing either the original NLPCA criterion (12.25) . 240 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA is a stochastic gradient algorithm and the other one a recursive least mean-square algorithm. 12.1 NONLINEAR. 12 ICA by Nonlinear Decorrelation and Nonlinear PCA This chapter starts by reviewing some of the early research efforts