Tài liệu Independent component analysis P12 pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	24
Dung lượng	693,89 KB

Nội dung

12 ICA by Nonlinear Decorrelation and Nonlinear PCA This chapter starts by reviewing some of the early research efforts in independent component analysis (ICA), especially the technique based on nonlinear decorrelation, that was successfully used by Jutten, H ´ erault, and Ans to solve the first ICA problems. Today, this work is mainly of historical interest, because there exist several more efficient algorithms for ICA. Nonlinear decorrelation can be seen as an extension of second-order methods such as whitening and principal component analysis (PCA). These methods give components that are uncorrelated linear combinations of input variables, as explained in Chapter 6. We will show that independent components can in some cases be found as nonlinearly uncorrelated linear combinations. The nonlinear functions used in this approach introduce higher order statistics into the solution method, making ICA possible. We then show how the work on nonlinear decorrelation eventually lead to the Cichocki-Unbehauen algorithm, which is essentially the same as the algorithm that we derived in Chapter 9 using the natural gradient. Next, the criterion of nonlinear decorrelation is extended and formalized to the theory of estimating functions, and the closely related EASI algorithm is reviewed. Another approach to ICA that is related to PCA is the so-called nonlinear PCA. A nonlinear representation is sought for the input data that minimizes a least mean- square error criterion. For the linear case, it was shown in Chapter 6 that principal components are obtained. It turns out that in some cases the nonlinear PCA approach gives independent components instead. We review the nonlinear PCA criterion and show its equivalence to other criteria like maximum likelihood (ML). Then, two typical learning rules introduced by the authors are reviewed, of which the first one 239 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 240 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA is a stochastic gradient algorithm and the other one a recursive least mean-square algorithm. 12.1 NONLINEAR CORRELATIONS AND INDEPENDENCE The correlation between two random variables and was discussed in detail in Chapter 2. Here we consider zero-mean variables only, so correlation and covariance are equal. Correlation is related to independence in such a way that independent variables are always uncorrelated. The opposite is not true, however: the variables can be uncorrelated, yet dependent. An example is a uniform density in a rotated square centered at the origin of the space, see e.g. Fig. 8.3. Both and are zero mean and uncorrelated, no matter what the orientation of the square, but they are independent only if the square is aligned with the coordinate axes. In some cases uncorrelatedness does imply independence, though; the best example is the case when the density of is constrained to be jointly gaussian. Extending the concept of correlation, we here define the nonlinear correlation of the random variables and as E . Here, and are two functions, of which at least one is nonlinear. Typical examples might be polynomials of degree higher than 1, or more complex functions like the hyperbolic tangent. This means that one or both of the random variables are first transformed nonlinearly to new variables and then the usual linear correlation between these new variables is considered. The question now is: Assuming that and are nonlinearly decorrelated in the sense E (12.1) can we say something about their independence? We would hope that by making this kind of nonlinear correlation zero, independence would be obtained under some additional conditions to be specified. There is a general theorem (see, e.g., [129]) stating that and are independent if and only if E E E (12.2) for all continuous functions and that are zero outside a finite interval. Based on this, it seems very difficult to approach independence rigorously, because the functions and are almost arbitrary. Some kind of approximations are needed. This problem was considered by Jutten and H ´ erault [228]. Let us assume that and are smooth functions that have derivatives of all orders in a neighborhood NONLINEAR CORRELATIONS AND INDEPENDENCE 241 of the origin. They can be expanded in Taylor series: where is shorthand for the coefficients of the th powers in the series. The product of the functions is then (12.3) and condition (12.1) is equivalent to E E (12.4) Obviously, a sufficient condition for this equation to hold is E (12.5) for all indices appearing in the series expansion (12.4). There may be other solutions in which the higher order correlations are not zero, but the coefficients happen to be just suitable to cancel the terms and make the sum in (12.4) exactly equal to zero. For nonpolynomial functions that have infinite Taylor expansions, such spurious solutions can be considered unlikely (we will see later that such spurious solutions do exist but they can be avoided by the theory of ML estimation). Again, a sufficient condition for (12.5) to hold is that the variables and are independent and one of E E is zero. Let us require that E for all powers appearing in its series expansion. But this is only possible if is an odd function; then the Taylor series contains only odd powers , and the powers in Eq. (12.5) will also be odd. Otherwise, we have the case that even moments of like the variance are zero, which is impossible unless is constant. To conclude, a sufficient (but not necessary) condition for the nonlinear uncorrelatedness (12.1) to hold is that and are independent, and for one of them, say , the nonlinearity is an odd function such that has zero mean. The preceding discussion is informal but should make it credible that nonlinear correlations are useful as a possible general criterion for independence. Several things have to be decided in practice: the first one is how to actually choose the functions . Is there some natural optimality criterion that can tell us that some functions 242 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA yx 1 21 12 -m -m 2 1 2 x y Fig. 12.1 The basic feedback circuit for the H ´ erault-Jutten algorithm. The element marked with is a summation are better than some other ones? This will be answered in Sections 12.3 and 12.4. The second problem is how we could solve Eq. (12.1), or nonlinearly decorrelate two variables . This is the topic of the next section. 12.2 THE H ´ ERAULT-JUTTEN ALGORITHM Consider the ICA model . Let us first look at a case, which was considered by H ´ erault, Jutten and Ans [178, 179, 226] in connection with the blind separation of two signals from two linear mixtures. The model is then H ´ erault and Jutten proposed the feedback circuit shown in Fig. 12.1 to solve the problem. The initial outputs are fed back to the system, and the outputs are recomputed until an equilibrium is reached. From Fig. 12.1 we have directly (12.6) (12.7) Before inputting the mixture signals to the network, they were normalized to zero mean, which means that the outputs also will have zero means. Defining a matrix with off-diagonal elements and diagonal elements equal to zero, these equations can be compactly written as Thus the input-output mapping of the network is (12.8) THE CICHOCKI-UNBEHAUEN ALGORITHM 243 Note that from the original ICA model we have , provided that is invertible. If ,then becomes equal to . However, the problem in blind separation is that the matrix is unknown. The solution that Jutten and H ´ erault introduced was to adapt the two feedback coefficients so that the outputs of the network become independent. Then the matrix has been implicitly inverted and the original sources have been found. For independence, they used the criterion of nonlinear correlations. They proposed the following learning rules: (12.9) (12.10) with the learning rate. Both functions are odd functions; typically, the functions were used, although the method also seems to work for or sign . Now, if the learning converges, then the right-hand sides must be zero on average, implying E E Thus independence has hopefully been attained for the outputs . A stability analysis for the H ´ erault-Jutten algorithm was presented by [408]. In the numerical computation of the matrix according to algorithm (12.9,12.10), the outputs on the right-hand side must also be updated at each step of the iteration. By Eq. (12.8), they too depend on , and solving them requires the inversion of matrix . As noted by Cichocki and Unbehauen [84], this matrix inversion may be computationally heavy, especially if this approach is extended to more than two sources and mixtures. One way to circumvent this problem is to make a rough approximation that seems to work in practice. Although the H ´ erault-Jutten algorithm was a very elegant pioneering solution to the ICA problem, we know now that it has some drawbacks in practice. The algorithm may work poorly or even fail to separate the sources altogether if the signals are badly scaled or the mixing matrix is ill-conditioned. The number of sources that the method can separate is severely limited. Also, although the local stability was shown in [408], good global convergence behavior is not guaranteed. 12.3 THE CICHOCKI-UNBEHAUEN ALGORITHM Starting from the H ´ erault-Jutten algorithm Cichocki, Unbehauen, and coworkers [82, 85, 84] derived an extension that has a much enhanced performance and reliability. Instead of a feedback circuit like the H ´ erault-Jutten network in Fig. 12.1, Cichocki 244 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA and Unbehauen proposed a feedforward network with weight matrix , with the mixture vector for input and with output . Now the dimensionality of the problem can be higher than 2. The goal is to adapt the matrix so that the elements of become independent. The learning algorithm for is as follows: (12.11) where is the learning rate, is a diagonal matrix whose elements determine the amplitude scaling for the elements of (typically, could be chosen as the unit matrix ), and and are two nonlinear scalar functions; the authors proposed a polynomial and a hyperbolic tangent. The notation means a column vector with elements . The argumentation showing that this algorithm will give independent components, too, is based on nonlinear decorrelations. Consider the stationary solution of this learning rule defined as the matrix for which E , with the expectation taken over the density of the mixtures . For this matrix, the update is on the average zero. Because this is a stochastic-approximation-typealgorithm(see Chapter 3), such stationarity is a necessary condition for convergence. Excluding the trivial solution ,wemusthave E Especially, for the off-diagonal elements, this implies E (12.12) which is exactly our definition of nonlinear decorrelation in Eq. (12.1) extended to output signals . The diagonal elements satisfy E showing that the diagonal elements of matrix only control the amplitude scaling of the outputs. The conclusion is that if the learning rule converges to a nonzero matrix ,then the outputs of the network must become nonlinearly decorrelated, and hopefully independent. The convergence analysis has been performed in [84]; for general principles of analyzing stochastic iteration algorithms like (12.11), see Chapter 3. The justification for the Cichocki-Unbehauen algorithm (12.11) in the original articles was based on nonlinear decorrelations, not on any rigorous cost functions that would be minimized by the algorithm. However, it is interesting to note that this algorithm, first appearing in the early 1990’s, is in fact the same as the popular natural gradient algorithm introduced later by Amari, Cichocki, and Young [12] as an extension to the original Bell-Sejnowski algorithm [36]. All we have to do is choose as the unit matrix, the function as the linear function , and the function as a sigmoidal related to the true density of the sources. The Amari-Cichocki-Young algorithm and the Bell-Sejnowski algorithm were reviewed in Chapter 9 and it was shown how the algorithms are derived from the rigorous maximum likelihood criterion. The maximum likelihood approach also tells us what kind of nonlinearities should be used, as discussed in Chapter 9. THE ESTIMATING FUNCTIONS APPROACH * 245 12.4 THE ESTIMATING FUNCTIONS APPR OACH * Consider the criterion of nonlinear decorrelations being zero, generalized to random variables , shown in Eq. (12.12). Among the possible roots of these equations are the source signals . When solving these in an algorithm like the H ´ erault-Jutten algorithm or the Cichocki-Unbehauen algorithm, one in fact solves the separating matrix . This notion was generalized and formalized by Amari and Cardoso [8] to the case of estimating functions. Again, consider the basic ICA model , where is a true separating matrix (we use this special notation here to avoid any confusion). An estimation function is a matrix-valued function such that E (12.13) This means that, taking the expectation with respect to the density of ,thetrue separating matrices are roots of the equation. Once these are solved from Eq. (12.13), the independent components are directly obtained. Example 12.1 Given a set of nonlinear functions , with , and defining a vector function , a suitable estimating function for ICA is (12.14) because obviously E becomes diagonal when is a true separating matrix and are independent and zero mean. Then the off-diagonal elements become E E E . The diagonal matrix determines the scales of the separated sources. Another estimating function is the right-hand side of the learning rule (12.11), There is a fundamental difference in the estimating function approach compared to most of the other approaches to ICA: the usual starting point in ICA is a cost function that somehow measures how independent or nongaussian the outputs are, and the independent components are solved by minimizing the cost function. In contrast, there is no such cost function here. The estimation function need not be the gradient of any other function. In this sense, the theory of estimating functions is very general and potentially useful for finding ICA algorithms. For a discussion of this approach in connection with neural networks, see [328]. It is not a trivial question how to design in practice an estimation function so that we can solve the ICA model. Even if we have two estimating functions that both have been shaped in such a way that separating matrices are their roots, what is a relevant measure to compare them? Statistical considerations are helpful here. Note that in practice, the densities of the sources and the mixtures are unknown in 246 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA the ICA model. It is impossible in practice to solve Eq. (12.13) as such, because the expectation cannot be formed. Instead, it has to be estimated using a finite sample of . Denoting this sample by , we use the sample function E Its root is then an estimator for the true separating matrix. Obviously (see Chapter 4), the root is a function of the training sample, and it is meaningful to consider its statistical properties like bias and variance. This gives a measure of goodness for the comparison of different estimation functions. The best estimating function is one that gives the smallest error between the true separating matrix and the estimate . A particularly relevant measure is (Fisher) efficiency or asymptotic variance, as the size of the sample grows large (see Chapter 4). The goal is to design an estimating function that gives the smallest variance, given the set of observations . Then the optimal amount of information is extracted from the training set. The general result provided by Amari and Cardoso [8] is that estimating functions of the form (12.14) are optimal in the sense that, given any estimating function , one can always find a better or at least equally good estimating function (in the sense of efficiency) having the form (12.15) (12.16) where is a diagonal matrix. Actually, the diagonal matrix has no effect on the off-diagonal elements of which are the ones determining the independence between ; the diagonal elements are simply scaling factors. The result shows that it is unnecessary to use a nonlinear function instead of as the other one of the two functions in nonlinear decorrelation. Only one nonlinear function , combined with , is sufficient. It is interesting that functions of exactly the type naturally emerge as gradients of cost functions such as likelihood; the question of how to choose the nonlinearity is also answered in that case. A further example is given in the following section. The preceding analysis is not related in any way to the practical methods for finding the roots of estimating functions. Due to the nonlinearities, closed-form solutions do not exist and numerical algorithms have to be used. The simplest iterative stochastic approximation algorithm for solving the roots of has the form (12.17) with an appropriate learning rate. In fact, we now discover that the learning rules (12.9), (12.10) and (12.11) are examples of this more general framework. EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE 247 12.5 EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE In most of the proposed approaches to ICA, the learning rules are gradient descent algorithms of cost (or contrast) functions. Many cases have been covered in previous chapters. Typically, the cost function has the form E , with some scalar function, and usually some additional constraints are used. Here again , and the form of the function and the probability density of determine the shape of the contrast function . It is easy to show (see the definition of matrix and vector gradients in Chapter 3) that E E (12.18) where is the gradient of .If is square and invertible, then and we have E (12.19) For appropriate nonlinearities , these gradients are estimating functions in the sense that the elements of must be statistically independent when the gradient becomes zero. Note also that in the form E , the first factor has the shape of an optimal estimating function (except for the diagonal elements); see eq. (12.15). Now we also know how the nonlinear function can be determined: it is directly the gradient of the function appearing in the original cost function. Unfortunately, the matrix inversion in (12.19) is cumbersome. Matrix inversion can be avoided by using the so-called natural gradient introduced by Amari [4]. This is covered in Chapter 3. The natural gradient is obtained in this case by multiplying the usual matrix gradient (12.19) from the right by matrix ,which gives E . The ensuing stochastic gradient algorithm to minimize the cost function is then (12.20) This learning rule again has the form of nonlinear decorrelations. Omitting the diagonal elements in matrix in , the off-diagonal elements have the same form as in the Cichocki-Unbehauen algorithm (12.11), with the two functions now given by the linear function and the gradient . This gradient algorithm can also be derived using the relative gradient introduced by Cardoso and Hvam Laheld [71]. This approach is also reviewed in Chapter 3. Based on this, the authors developed their equivariant adaptive separation via independence (EASI) learning algorithm. To proceed from (12.20) to the EASI learning rule, an extra step must be taken. In EASI, as in many other learning rules for ICA, a whitening preprocessing is considered for the mixture vectors (see Chapter 6). We first transform linearly to whose elements have 248 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA unit variances and zero covariances: E . As also shown in Chapter 6, an appropriate adaptation rule for whitening is (12.21) The ICA model using these whitened vectors instead of the original ones becomes , and it is easily seen that the matrix is an orthogonal matrix (a rotation). Thus its inverse which gives the separating matrix is also orthogonal. As in earlier chapters, let us denote the orthogonal separating matrix by . Basically, the learning rule for would be the same as (12.20). However, as noted by [71], certain constraints must hold in any updating of if the orthogonality is to be preserved at each iteration step. Let us denote the serial update for using the learning rule (12.20), briefly, as , where now . The orthogonality condition for the updated matrix becomes where has been substituted. Assuming small, the first-order approximation gives the condition that ,or must be skew-symmetric. Applying this condition to the relative gradient learning rule (12.20) for ,wehave (12.22) where now . Contrary to the learning rule (12.20), this learning rule also takes care of the diagonal elements of in a natural way, without imposing any conditions on them. What is left now is to combine the two learning rules (12.21) and (12.22) into just one learning rule for the global system separation matrix. Because , this global separation matrix is . Assuming the same learning rates for the two algorithms, a first order approximation gives (12.23) This is the EASI algorithm. It has the nice feature of combining both whitening and separation into a single algorithm. A convergence analysis as well as some experimental results are given in [71]. One can easily see the close connection to the nonlinear decorrelation algorithm introduced earlier. The concept of equivariance that forms part of the name of the EASI algorithm is a general concept in statistical estimation; see, e.g., [395]. Equivariance of an estimator means, roughly, that its performance does not depend on the actual value of the parameter. In the context of the basic ICA model, this means that the ICs can be estimated with the same performance what ever the mixing matrix may be. EASI was one of the first ICA algorithms which was explicitly shown to be equivariant. In fact, most estimators of the basic ICA model are equivariant. For a detailed discussion, see [69]. [...]... densities, the same effect of rotation into independent directions would not be achieved Certainly, this would not take place for gaussian densities with equal variances, for which the criterion J ( 1 ::: n ) would be independent of the orientation Whether the criterion results in independent components, depends strongly on the nonlinearities gi (y ) A more detailed analysis of the criterion (12.25) and... gives the approximation to In the optimal solution that minimizes the criterion J ( 1 ::: n ), such factors might be termed nonlinear principal components Therefore, the technique of finding the basis vectors i is here called “nonlinear principal component analysis (NLPCA) It should be emphasized that practically always when a well-defined linear problem is extended into a nonlinear one, many ambiguities... statistically independent The goal in analyzing the learning rule (12.41) is to show that, starting from some initial value, the matrix will tend to the separating matrix For the transformed T in (12.45), this translates into the requirement that weight matrix = should tend to the unit matrix or a permutation matrix Then = also would tend to the vector , or a permuted version, with independent components... orthonormal For nonlinear functions gi (y ), however, this is usually not true Instead, in some cases, at least, it turns out that the optimal basis vectors wi minimizing (12.25) will be aligned with the independent components of the input vectors Example 12.2 Assume that x is a two-dimensional random vector that has a uniform density in a unit square that is not aligned with the coordinate axes x1 x2 , according... covariance matrix of x is therefore equal to 1=3I Thus, except for the scaling by 1=3, vector x is whitened (sphered) However, the elements are not independent The problem is to find a rotation s = Wx of x such that the elements of the rotated vector s are statistically independent It is obvious from Fig 12.2 that the elements of s must be aligned with the orientation of the square, because then and only then... hold that w1 w2 = 0 The solution minimizing the criterion (12.25), with w1 w2 orthogonal twodimensional vectors and g1 (:) = g2 (:) = g (:) a suitable nonlinearity, provides now a rotation into independent components This can be seen as follows Assume that g is a very sharp sigmoid, e.g., g(y) = tanh(10y), which is approximately the sign T function The term 2=1 g (wi x)wi in criterion (12.25) becomes...249 NONLINEAR PRINCIPAL COMPONENTS 12.6 NONLINEAR PRINCIPAL COMPONENTS One of the basic definitions of PCA was optimal least mean-square error compression, as explained in more detail in Chapter 6 Assuming a random m-dimensional zero-mean vector... assume that the ICA model holds, i.e., there exists an orthogonal separating matrix M such that s where the elements of s are statistically independent With whitening, the dimension of z has been reduced to that of s; thus both M and W are n n matrices To make further analysis easier, we proceed by making a linear transformation to the learning rule (12.41): we multiply both sides by the orthogonal separating... principal components are nonlinear functions of It should be noted that x w w x wx w 250 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA minimizing the criterion (12.25) does not give a smaller least mean square error than standard PCA Instead, the virtue of this criterion is that it introduces higher-order statistics in a simple manner via the nonlinearities gi Before going into any deeper analysis. .. solution (although not the unique one) of this optimization problem is given by the eigenvectors 1 ::: n of the data covariance matrix x = Ef T g Then the linear T factors i in the sum become the principal components T i For instance, if is two-dimensional with a gaussian density, and we seek for a one-dimensional subspace (a straight line passing through the center of the density), then the solution is . be independent of the orientation. Whether the criterion results in independent components, depends strongly on the nonlinearities . A more detailed analysis. second-order methods such as whitening and principal component analysis (PCA). These methods give components that are uncorrelated linear combinations

Ngày đăng: 26/01/2014, 07:20

Xem thêm