1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Independent component analysis P16

9 320 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 9
Dung lượng 184,9 KB

Nội dung

16 ICA with Overcomplete Bases A difficult problem in independent component analysis (ICA) is encountered if the number of mixtures x i is smaller than the number of independent components s i .This means that the mixing system is not invertible: We cannot obtain the independent components (ICs) by simply inverting the mixing matrix A . Therefore, even if we knew the mixing matrix exactly, we could not recover the exact values of the independent components. This is because information is lost in the mixing process. This situation is often called ICA with overcomplete bases. This is because we have in the ICA model x = As = X i a i s i (16.1) where the number of “basis vectors”, a i , is larger than the dimension of the space of x : thus this basis is “too large”, or overcomplete. Such a situation sometimes occurs in feature extraction of images, for example. As with noisy ICA, we actually have two different problems. First, how to estimate the mixing matrix, and second, how to estimate the realizations of the independent components. This is in stark contrast to ordinary ICA, where these two problems are solved at the same time. This problem is similar to the noisy ICA in another respect as well: It is much more difficult than the basic ICA problem, and the estimation methods are less developed. 305 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 306 ICA WITH OVERCOMPLETE BASES 16.1 ESTIMATION OF THE INDEPENDENT COMPONENTS 16.1.1 Maximum likelihood estimation Many methods for estimating the mixing matrix use as subroutines methods that estimate the independent components for a known mixing matrix. Therefore, we shall first treat methods for reconstructing the independent components, assuming that we know the mixing matrix. Let us denote by m the number of mixtures and by n the number of independent components. Thus, the mixing matrix has size m  n with n>m , and therefore it is not invertible. The simplest method of estimating the independent components would be to use the pseudoinverse of the mixing matrix. This yields ^ s = A T (AA T ) 1 x (16.2) In some situations, such a simple pseudoinverse gives a satisfactory solution, but in many cases we need a more sophisticated estimate. A more sophisticated estimator of s can be obtained by maximum likelihood (ML) estimation [337, 275, 195], in a manner similar to the derivation of the ML or maximum a posteriori (MAP) estimator of the noise-free independent components in Chapter 15. We can write the posterior probability of s as follows: p(sjx A)=1 x=As Y i p i (s i ) (16.3) where 1 x=As is an indicator function that is 1 if x = As and 0 otherwise. The (prior) probability densities of the independent components are given by p i (s i ) . Thus, we obtain the maximum likelihood estimator of s as ^ s = arg max x=As X i log p i (s i ) (16.4) Alternatively, we could assume that there is noise present as well. In this case, we get a likelihood that is formally the same as with ordinary noisy mixtures in (15.16). The only difference is in the number of components in the formula. The problem with the maximum likelihood estimator is that it is not easy to compute. This optimization cannot be expressed as a simple function in analytic form in any interesting case. It can be obtained in closed form if the s i have gaussian distribution: In this case the optimum is given by the pseudoinverse in (16.2). However, since ICA with gaussian variables is of little interest, the pseudoinverse is not a very satisfactory solution in many cases. In general, therefore, the estimator given by (16.4) can only be obtained by numerical optimization. A gradient ascent method can be easily derived. One case where the optimization is easier than usual is when the s i have a Laplacian distribution: p i (s i )= 1 p 2 exp( p 2js i j) (16.5)  ESTIMATION OF THE MIXING MATRIX 307 Ignoring uninteresting constants, we have ^ s = arg x=As X i js i j (16.6) which can be formulated as a linear program and solved by classic methods for linear programming [275]. 16.1.2 The case of supergaussian components Using a supergaussian distribution, such as the Laplacian distribution, is well justified in feature extraction, where the components are supergaussian. Using the Laplacian density also leads to an interesting phenomenon: The ML estimator gives coefficients ^s i of which only m are nonzero. Thus, only the minimum number of the components are activated. Thus we obtain a sparse decomposition in the sense that the components are quite often equal to zero. It may seem at first glance that it is useless to try to estimate the ICs by ML estimation, because they cannot be estimated exactly in any case. This is not so, however; due to this phenomenon of sparsity, the ML estimation is very useful. In the case where the independent components are very supergaussian, most of them are very close to zero because of the large peak of the pdf at zero. (This is related to the principle of sparse coding that will be treated in more detail in Section 21.2.) Thus, those components that are not zero may not be very many, and the system may be invertible for those components. If we first determine which components are likely to be clearly nonzero, and then invert that part of the linear system, we may be able to get quite accurate reconstructions of the ICs. This is done implicitly in the ML estimation method. For example, assume that there are three speech signals mixed into two mixtures. Since speech signals are practically zero most of the time (which is reflected in their strong supergaussianity), we could assume that only two of the signals are nonzero at the same time, and successfully reconstruct those two signals [272]. 16.2 ESTIMATION OF THE MIXING MATRIX 16.2.1 Maximizing joint likelihood To estimate the mixing matrix, one can use maximum likelihood estimation. In the simplest case of ML estimation, we formulate the joint likelihood of A and the realization of the s i , and maximize it with respect to all these variables. It is slightly simpler to use a noisy version of the joint likelihood. This is of the same form as the one in Eq. (15.16): log L(A s(1) ::: s(T )) =  T X t=1 " 1 2 2 kAs(t)  x(t)k 2 + n X i=1 f i (s i (t)) # + C (16.7) min 308 ICA WITH OVERCOMPLETE BASES where  2 is the noise variance, here assumed to be infinitely small, the s(t) are the realizations of the independent components, and C is an irrelevant constant. The functions f i are the log-densities of the independent components. Maximization of (16.7) with respect to A and s i could be accomplished by a global gradient ascent with respect to all the variables [337]. Another approach to maximization of the likelihood is to use an alternating variables technique [195], in which we first compute the ML estimates of the s i (t) for a fixed A and then, using this new we compute the ML estimates of the and so on. The ML estimate of the s i (t) for a given A is given by the methods of the preceding section, considering the noise to be infinitely small. The ML estimate of A for given s i (t) can be computed as A =( X t x(t)x(t) T ) 1 X t x(t)s(t) T (16.8) This algorithm needs some extra stabilization, however. For example, normalizing the estimates of the s i to unit norm is necessary. Further stabilization can be obtained by first whitening the data. Then we have (considering infinitely small noise) E fxx T g = AA T = I (16.9) which means that the rows of A form an orthonormal system. This orthonormality could be enforced after every step of (16.8), for further stabilization. 16.2.2 Maximizing likelihood approximations Maximization of the joint likelihood is a rather crude method of estimation. From a Bayesian viewpoint, what we really want to maximize is the marginal posterior probability of the mixing matrix. (For basic concepts of Bayesian estimation, see Section 4.6.) A more sophisticated form of maximum likelihood estimation can be obtained by using a Laplace approximation of the posterior distribution of A . This improves the stability of the algorithm, and has been successfully used for estimation of overcomplete bases from image data [274], as well as for separation of audio signals [272]. For details on the Laplace approximation, see [275]. An alternative for the Laplace approximation is provided by ensemble learning; see Section 17.5.1. A promising direction of research is given by Monte Carlo methods. These are a class of methods often used in Bayesian estimation, and are based on numerical integration using stochastic algorithms. One method in this class, Gibbs sampling, has been used in [338] for overcomplete basis estimation. Monte Carlo methods typically give estimators with good statistical properties; the drawback is that they are computationally very demanding. Also, one could use an expectation-maximization (EM) algorithm [310, 19]. Using gaussian mixtures as models for the distributions of the independent components, the algorithm can be derived in analytical form. The problem is, however, that its complexity grows exponentially with the dimension of s , and thus it can only be used A , s i (t) , ESTIMATION OF THE MIXING MATRIX 309 in small dimensions. Suitable approximations of the algorithm might alleviate this limitation [19]. A very different approximation of the likelihood method was derived in [195], in which a form of competitive neural learning was used to estimate overcomplete bases with supergaussian data. This is a computationally powerful approximation that seems to work for certain data sets. The idea is that the extreme case of sparsity or supergaussianity is encountered when at most one of the ICs is nonzero at any one time. Thus we could simply assume that only one of the components is nonzero for a given data point, for example, the one with the highest value in the pseudoinverse reconstruction. This is not a realistic assumption in itself, but it may give an interesting approximation of the real situation in some cases. 16.2.3 Approximate estimation using quasiorthogonality The maximum likelihood methods discussed in the preceding sections give a well- justified approach to ICA estimation with overcomplete bases. The problem with most of the methods in the preceding section is that they are computationally quite expensive. A typical application of ICA with overcomplete bases is, however, feature extraction. In feature extraction, we usually have spaces of very high dimensions. Therefore, we show here a method [203] that is more heuristically justified, but has the advantage of being not more expensive computationally than methods for basic ICA estimation. This method is based on the FastICA algorithm, combined with the concept of quasiorthogonality. Sparse approximately uncorrelated decompositions Our heuristic ap- proach is justified by the fact that in feature extraction for many kinds of natural data, the ICA model is only a rather coarse approximation. In particular, the number of potential “independent components” seems to be infinite: The set of such com- ponents is closer to a continuous manifold than a discrete set. One evidence for this is that classic ICA estimation methods give different basis vectors when started with different initial values, and the number of components thus produced does not seem to be limited. Any classic ICA estimation method gives a rather arbitrary col- lection of components which are somewhat independent, and have sparse marginal distributions. We can also assume, for simplicity, that the data is prewhitened as a preprocessing step, as in most ICA method in Part II. Then the independent components are simply given by the dot-products of the whitened data vector z with the basis vectors a i . Due to the preceding considerations, we assume in our approach that what is usually needed, is a collection of basis vectors that has the following two properties: 1. The dot-products a T i z of the observed data with the basis vectors have sparse (supergaussian) marginal distributions. 2. The a T i z should be approximately uncorrelated (“quasiuncorrelated”). Equiva- lently, the vectors a i should be approximately orthogonal (“quasiorthogonal”). 310 ICA WITH OVERCOMPLETE BASES A decomposition with these two properties seems to capture the essential properties of the decomposition obtained by estimation of the ICA model. Such decompositions could be called sparse approximately uncorrelated decompositions. It is clear that it is possible to find highly overcomplete basis sets that have the first property of these two. Classic ICA estimation is usually based on maximizing the sparseness (or, in general, nongaussianity) of the dot-products, so the existence of several different classic ICA decompositions for a given image data set shows the existence of decompositions with the first property. What is not obvious, however, is that it is possible to find strongly overcomplete decompositions such that the dot-products are approximately uncorrelated. The main point here is that this is possible because of the phenomenon of quasiorthogonality. Quasiorthogonality in high-dimensional spaces Quasiorthogonality [247] is a somewhat counterintuitive phenomenon encountered in very high-dimensional spaces. In a certain sense, there is much more room for vectors in high-dimensional spaces. The point is that in an n -dimensional space, where n is large, it is possible to have (say) 2n vectors that are practically orthogonal, i.e., their angles are close to 90 degrees. In fact, when n grows, the angles can be made arbitrarily close to 90 degrees. This must be contrasted with small-dimensional spaces: If, for example, n =2 , even the maximally separated 2n =4 vectors exhibit angles of 45 degrees. For example, in image decomposition, we are usually dealing with spaces whose dimensions are of the order of 100. Therefore, we can easily find decompositions of, say, 400 basis vectors, such that the vectors are quite orthogonal, with practically all the angles between basis vectors staying above 80 degrees. FastICA with quasiorthogonalization To obtain a quasiuncorrelated sparse decomposition as defined above, we need two things. First, a method for finding vectors a i that have maximally sparse dot-products, and second, a method of qua- siorthogonalization of such vectors. Actually, most classic ICA algorithms can be considered as maximizing the nongaussianity of the dot-products with the basis vec- tors, provided that the data is prewhitened. (This was shown in Chapter 8.) Thus the main problem here is constructing a proper method for quasidecorrelation. We have developed two methods for quasidecorrelation: one of them is symmet- ric and the other one is deflationary. This dichotomy is the same as in ordinary decorrelation methods used in ICA. As above, it is here assumed that the data is whitened. A simple way of achieving quasiorthogonalization is to modify the ordinary deflation scheme based on a Gram-Schmidt-like orthogonalization. This means that we estimate the basis vectors one by one. When we have estimated p basis vectors a 1 :::a p , we run the one-unit fixed-point algorithm for a p+1 , and after every iteration step subtract from a p+1 a certain proportion of the ’projections’ a T p+1 a j a j j =1:::p of the previously estimated p vectors, and then renormalize ESTIMATION OF THE MIXING MATRIX 311 a p+1 : 1. a p+1  a p+1   P p j =1 a T p+1 a j a j 2. a p+1  a p+1 =ka p+1 k (16.10) where  is a constant determining the force of quasiorthogonalization. If  = 1 ,we have ordinary, perfect orthogonalization. We have found in our experiments that an  in the range 0:1 ::: 0:3] is sufficient in spaces where the dimension is 64. In certain applications it may be desirable to use a symmetric version of quasi- orthogonalization, in which no vectors are “privileged” over others [210, 197]. This can be accomplished, for example, by the following algorithm: 1. A  3 2 A  1 2 AA T A: 2. Normalize each column of A to unit norm (16.11) which is closely related to the iterative symmetric orthogonalization method used for basic ICA in Section 8.4.3. The present algorithm is simply doing one iteration of the iterative algorithm. In some cases, it may be necessary to do two or more iterations, although in the experiments below, just one iteration was sufficient. Thus, the algorithm that we propose is similar to the FastICA algorithm as de- scribed, e.g. in Section 8.3.5 in all other respects than the orthogonalization, which is replaced by one of the preceding quasiorthogonalization methods. Experiments with overcomplete image bases We applied our algorithm on image windows (patches) of 8  8 pixels taken from natural images. Thus, we used ICA for feature extraction as explained in detail in Chapter 21. The mean of the image window (DC component) was removed as a preprocess- ing step, so the dimension of the data was 63. Both deflationary and symmetric quasiorthogonalization were used. The nonlinearity used in the FastICA algorithm was the hyperbolic tangent. Fig. 16.1 shows an estimated approximately 4 times overcomplete basis (with 240 components). The sample size was 14000. The results shown here were obtained using the symmetric approach; the deflationary approach yielded similar results, with the parameter  fixed at 0:1 . The results show that the estimated basis vectors are qualitatively quite similar to those obtained by other, computationally more expensive methods [274]; they are also similar to those obtained by basic ICA (see Chapter 21). Moreover, by computing the dot-products between different basis vectors, we see that the basis is, indeed, quasiorthogonal. This validates our heuristic approach. 16.2.4 Other approaches We mention here some other algorithms for estimation of overcomplete bases. First, in [341], independent components with binary values were considered, and a geo- metrically motivated method was proposed. Second, a tensorial algorithm for the overcomplete estimation problem was proposed in [63]. Related theoretical results were derived in [58]. Third, a natural gradient approach was developed in [5]. Fur- 312 ICA WITH OVERCOMPLETE BASES Fig. 16.1 The basis vectors of a 4 times overcomplete basis. The dimension of the data is 63 (excluding the DC component, i.e., the local mean) and the number of basis vectors is 240. The results are shown in the original space, i.e., the inverse of the preprocessing (whitening) was performed. The symmetric approach was used. The basis vectors are very similar to Gabor functions or wavelets, as is typical with image data (see Chapter 21). CONCLUDING REMARKS 313 ther developments on estimation of overcomplete bases using methods similar to the preceding quasiorthogonalization algorithm can be found in [208]. 16.3 CONCLUDING REMARKS The ICA problem becomes much more complicated if there are more independent components than observed mixtures. Basic ICA methods cannot be used as such. In most practical applications, it may be more useful to use the basic ICA model as an approximation of the overcomplete basis model, because the estimation of the basic model can be performed with reliable and efficient algorithms. When the basis is overcomplete, the formulation of the likelihood is difficult, since the problem belongs to the class of missing data problems. Methods based on maximum likelihood estimation are therefore computationally rather inefficient. To obtain computationally efficient algorithms, strong approximations are necessary. For example, one can use a modification of the FastICA algorithm that is based on finding a quasidecorrelating sparse decomposition. This algorithm is computationally very efficient, reducing the complexity of overcomplete basis estimation to that of classic ICA estimation. . difficult problem in independent component analysis (ICA) is encountered if the number of mixtures x i is smaller than the number of independent components s i. estimate the independent components for a known mixing matrix. Therefore, we shall first treat methods for reconstructing the independent components, assuming

Ngày đăng: 20/10/2013, 10:15