Tài liệu Independent Component Analysis - Chapter 17: Nonlinear ICA ppt

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	26
Dung lượng	1,14 MB

Nội dung

17 Nonlinear ICA This chapter deals with independent component analysis (ICA) for nonlinear mixing models. A fundamental difficulty in the nonlinear ICA problem is that it is highly nonunique withoutsome extra constraints, which are often realized by using a suitable regularization. We also address the nonlinear blind source separation (BSS) problem. Contrary to the linear case, we consider it different from the respective nonlinear ICA problem. After considering these matters, some methods introduced for solving the nonlinear ICA or BSS problems are discussed in more detail. Special emphasis is given to a Bayesian approach that applies ensemble learning to a flexible multilayer perceptron model for finding the sources and nonlinear mixing mapping that have most probably given rise to the observed mixed data. The efficiency of this method is demonstrated using both artificial and real-world data. At the end of the chapter, other techniques proposed for solving the nonlinear ICA and BSS problems are reviewed. 17.1 NONLINEAR ICA AND BSS 17.1.1 The nonlinear ICA and BSS problems In many situations, the basic linear ICA or BSS model (17.1) is too simple for describing the observed data adequately. Hence, it is natural to consider extension of the linear model to nonlinear mixing models. For instantaneous 315 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 316 NONLINEAR ICA mixtures, the nonlinear mixing model has the general form (17.2) where is the observed -dimensional data (mixture) vector, is an unknown real- valued -component mixing function, and is an -vector whose elements are the unknown independent components. Assume now for simplicity that the number of independent components equals the number of mixtures . The general nonlinear ICA problem then consists of finding a mapping that gives components (17.3) that are statistically independent. A fundamental characteristic of the nonlinear ICA problem is that in the general case, solutions always exist, and they are highly nonunique. One reason for this is that if and are two independent random variables, any of their functions and are also independent. An even more serious problem is that in the nonlinear case, and can be mixed and still statistically independent, as will be shown below. This is not unlike in the case of gaussian ICs in a linear mixing. In this chapter, we define BSS in a special way to clarify the distinction between finding independent components, and finding the original sources. Thus, in the respective nonlinear BSS problem, one should find the original source signals that have generated the observed data. This is usually a clearly more meaningful and unique problem than nonlinear ICA defined above, provided that suitable prior information is available on the sources and/or the mixing mapping. It is worth emphasizing that if some arbitrary independent components are found for the data generated by (17.2), they may be quite different from the true source signals. Hence the situation differs greatly from the basic linear data model (17.1), for which the ICA or BSS problems have the same solution. Generally, solving the nonlinear BSS problem is not easy, and requires additional prior information or suitable regularizing constraints. An important special case of the general nonlinear mixing model (17.2) consists of so-called post-nonlinear mixtures. There each mixture has the form (17.4) Thus the sources , are first mixed linearly according to the basic ICA/BSS model (17.1), but after that a nonlinear function is applied to them to get the final observations . It can be shown [418] that for the post-nonlinear mixtures, the indeterminacies are usually the same as for the basic linear instantaneous mixing model (17.1). That is, the sources can be separated or the independent components estimated up to the scaling, permutation, and sign indeterminacies under weak conditions on the mixing matrix and source distributions. The post-nonlinearity assumption is useful and reasonable in many signal processing applications, because NONLINEAR ICA AND BSS 317 it can be thought of as a model for a nonlinear sensor distortion. In more general situations, it is a restrictive and somewhat arbitrary constraint. This model will be treated in more detail below. Another difficulty in the general nonlinear BSS (or ICA) methods proposed thus far is that they tend to be computationally rather demanding. Moreover, the computational load usually increases very rapidly with the dimensionality of the problem, preventing in practice the application of nonlinear BSS methods to high-dimensional data sets. The nonlinear BSS and ICA methods presented in the literature could be divided into two broad classes: generative approaches and signal transformation approaches [438]. In the generative approaches, the goal is to find a specific model that explains how the observations were generated. In our case, this amounts to estimating both the source signals and the unknown mixing mapping that have generated the observed data through the general mapping (17.2). In the signal transformation methods, one tries to estimate the sources directly using the inverse transformation (17.3). In these methods, the number of estimated sources is the same as the number of observed mixtures [438]. 17.1.2 Existence and uniqueness of nonlinear ICA The question of existence and uniqueness of solutions for nonlinear independent component analysis has been addressed in [213]. The authors show that there always exists an infinity of solutions if the space of the nonlinear mixing functions is not limited. They also present a method for constructing parameterized families of nonlinear ICA solutions. A unique solution (up to a rotation) can be obtained in the two-dimensional special case if the mixing mapping is constrained to be a conformal mapping together with some other assumptions; see [213] for details. In the following, we present in more detail the constructive method introduced in [213] that always yields at least one solution to the nonlinear ICA problem. This procedure might be considered as a generalization of the well-known Gram-Schmidt orthogonalization method. Given independent variables = and a variable , a new variable = is constructed so that the set is mutually independent. The construction is defined recursively as follows. Assume that we have already independent random variables which are jointly uniformly distributed in . Here it is not a restriction to assume that the distributions of the are uniform, since this follows directly from the recursion, as will be seen below; for a single variable, uniformity can be attained by the probability integral transformation; see (2.85). Denote by any random variable, and by some nonrandom 318 NONLINEAR ICA scalars. Define (17.5) where and are the marginal probability densities of and , respectively (it is assumed here implicitly that such densities exist),and denotes the conditional probability. The in the argument of is to remind that depends on the joint probability distribution of and .For , is simply the cumulative distribution function of .Now, as defined above gives a nonlinear decomposition, as stated in the following theorem. Theorem 17.1 Assume that are independent scalar random variables that have a joint uniform distribution in the unit cube .Let be any scalar random variable. Define as in (17.5), and set (17.6) Then is independent from the , and the variables are jointly uniformly distributed in the unit cube . The theorem is proved in [213]. The constructive method given above can be used to decompose variables into independent components , giving a solution for the nonlinear ICA problem. This construction also clearly shows that the decomposition in independent components is by no means unique. For example, we could first apply a linear transformation on the to obtain another random vector = , and then compute = with being defined using the above procedure, where is replaced by . Thus we obtain another decomposition of into independent components. The resulting decomposition = is in general different from , and cannot be reduced to by any simple transformations. A more rigorous justification of the nonuniqueness property has been given in [213]. Lin [278] has recently derived some interesting theoretical results on ICA that are useful in describing the nonuniqueness of the general nonlinear ICA problem. Let the matrices and denote the Hessians of the logarithmic probability densities and of the source vector and mixture (data) vector , respectively. Then for the basic linear ICA model (17.1) it holds that (17.7) where is the mixing matrix. If the components of are truly independent, should be a diagonal matrix. Due to the symmetry of the Hessian matrices and , Eq. (17.7) imposes constraints for the elements of the matrix . Thus a constant mixing matrix can be solved by estimating at two different points, and assuming some values for the diagonal elements of . SEPARATION OF POST-NONLINEAR MIXTURES 319 If the nonlinear mapping (17.2) is twice differentiable, we can approximate it locally at any point by the linear mixing model (17.1). There is defined by the first order term of the Taylor series expansion of at the desired point. But now generally changes from point to point, so that the constraint conditions (17.7) still leave degrees of freedom for determining the mixing matrix (omitting the diagonal elements). This also shows that the nonlinear ICA problem is highly nonunique. Taleb and Jutten have considered separability of nonlinear mixtures in [418, 227]. Their general conclusion is the same as earlier: Separation is impossible without additional prior knowledge on the model, since the independence assumption alone is not strong enough in the general nonlinear case. 17.2 SEPARATION OF POST-NONLINEAR MIXTURES Before discussing approaches applicable to general nonlinear mixtures, let us briefly consider blind separation methods proposed for the simpler case of post-nonlinear mixtures (17.4). Especially Taleb and Jutten have developed BSS methods for this case. Their main results have been represented in [418], and a short overview of their studies on this problem can be found in [227]. In the following, we present the the main points of their method. A separation method for the post-nonlinear mixtures (17.4) should generally consist of two subsequent parts or stages: 1. A nonlinear stage, which should cancel the nonlinear distortions . This part consists of nonlinear functions . The parameters of each nonlinearity are adjusted so that cancellation is achieved (at least roughly). 2. A linear stage that separates the approximately linear mixtures obtainedafter the nonlinear stage. This is done as usual by learning a separating matrix for which the components of the output vector = of the separating system are statistically independent (or as independent as possible). Taleb and Jutten [418] use the mutual information between the components of the output vector (see Chapter 10) as the cost function and independence criterion in both stages. For the linear part, minimization of the mutual information leads to the familiar Bell-Sejnowski algorithm (see Chapters 10 and 9) E (17.8) where components of the vector are score functions of the components of the output vector : (17.9) 320 NONLINEAR ICA Here is the probability density function of and its derivative. In practice, the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8); see Chapter 9. For the nonlinear stage, one can derive the gradient learning rule [418] E E Here is the th component of the input vector, is the element of the matrix ,and is the derivative of the th nonlinear function . The exact computation algorithm depends naturally on the specific parametric form of the chosen nonlinear mapping . In [418], a multilayer perceptron network is used for modeling the functions , . In linear BSS, it suffices that the score functions (17.9) are of the right type for achieving separation. However, their appropriate estimation is critical for the good performance of the proposed nonlinear separation method. The score functions (17.9) must be estimated adaptively from the output vector . Several alternative ways to do this are considered in [418]. An estimation method based on the Gram-Charlier expansion performs appropriately only for mild post-nonlinear distortions. However, another method, which estimates the score functions directly, also provides very good results for hard nonlinearities. Experimental results are presented in [418]. A well performing batch type method for estimating the score functions has been introduced in a later paper [417]. Before proceeding, we mention that separation of post-nonlinear mixtures also has been studied in [271, 267, 469] using mainly extensions of the natural gradient algorithm. 17.3 NONLINEAR BSS USING SELF-ORGANIZING MAPS One of the earliest ideas for achieving nonlinear BSS (or ICA) is to use Kohonen’s self-organizing map (SOM) to that end. This method was originally introduced by Pajunen et al. [345]. The SOM [247, 172] is a well-known mapping and visualization method that in an unsupervised manner learns a nonlinear mapping from the data to a usually two-dimensional grid. The learned mapping from often high-dimensional data space to the grid is such that it tries to preserve the structure of the data as well as possible. Another goal in the SOM method is to map the data so that it would be uniformly distributed on the rectangular (or hexagonal) grid. This can be roughly achieved with suitable choices [345]. If the joint probability density of two random variables is uniformly distributed inside a rectangle, then clearly the marginal densities along the sides of the rectangle are statistically independent. This observation gives the justification for applying self-organizing map to nonlinear BSS or ICA. The SOM mapping provides the regularization needed in nonlinear BSS, because it tries to preserve the structure NONLINEAR BSS USING SELF-ORGANIZING MAPS 321 of the data. This implies that the mapping should be as simple as possible while achieving the desired goals. 0 5 10 15 20 25 30 35 40 45 50 −1 −0.5 0 0.5 1 0 5 10 15 20 25 30 35 40 45 50 −1.5 −1 −0.5 0 0.5 1 1.5 Original signals Fig. 17.1 Original source signals. 0 5 10 15 20 25 30 35 40 45 50 −10 −5 0 5 10 0 5 10 15 20 25 30 35 40 45 50 −20 −10 0 10 20 Fig. 17.2 Nonlinear mixtures. The following experiment [345] illustrates the use of the self-organizing map in nonlinear blind source separation. There were two subgaussian source signals shown in Fig. 17.1, consisting of a sinusoid and uniformly distributed white noise. Each source vector was first mixed linearly using the mixing matrix (17.10) After this, the data vectors were obtained as post-nonlinear mixtures of the sources by applying the formula (17.4), where the nonlinearity = , .These mixtures are depicted in Fig. 17.2. 0 5 10 15 20 25 30 35 40 45 50 −20 −15 −10 −5 0 0 5 10 15 20 25 30 35 40 45 50 −15 −10 −5 0 5 Fig. 17.3 Signals separated by SOM. −10 −8 −6 −4 −2 0 2 4 6 8 10 −15 −10 −5 0 5 10 15 Fig. 17.4 Converged SOM map. The sources separated by the SOM method are shown in Fig. 17.3, and the converged SOM map is illustrated in Fig. 17.4. The estimates of the source signals in Fig. 17.3 are obtained by mapping each data vector onto the map of Fig. 17.4, 322 NONLINEAR ICA and reading the coordinates of the mapped data vector. Even though the preceding experiment was carried out with post-nonlinear mixtures, the use of the SOM method is not limited to them. Generally speaking, there are several difficulties in applying self-organizing maps to nonlinear blind source separation. If the sources are uniformly distributed, then it can be heuristically justified that the regularization of the nonlinear separating mapping provided by the SOM approximately separates the sources. But if the true sources are not uniformly distributed, the separating mapping providing uniform densities inevitably causes distortions, which are in general the more serious the farther the true source densities are from the uniform ones. Of course, the SOM method still provides an approximate solution to the nonlinear ICA problem, but this solution may have little to do with the true source signals. Another difficulty in using SOM for nonlinear BSS or ICA is that computational complexity increases very rapidly with the number of the sources (dimensionality of the map), limiting the potential application of this method to small-scale problems. Furthermore, the mapping provided by the SOM is discrete, where the discretization is determined by the number of grid points. 17.4 A GENERATIVE TOPOGRAPHIC MAPPING APPROACH TO NONLINEAR BSS * 17.4.1 Background The self-organizing map discussed briefly in the previous section is a nonlinear mapping method that is inspired by neurobiological modeling arguments. Bishop, Svensen and Williams introduced the generative topographic mapping (GTM) method as a statistically more principled alternative to SOM. Their method is presented in detail in [49]. In the basic GTM method, mutually similar impulse (delta) functions that are equispaced on a rectangular grid are used to model the discrete uniform density in the space of latent variables, or the joint density of the sources in our case. The mapping from the sources to the observed data, corresponding in our nonlinear BSS problem to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussians model. The parameters of the mixture-of-gaussians model, defining the mixing mapping, are then estimated using a maximum likelihood (ML) method (see Section 4.5) realized by the expectation-maximization (EM) algorithm [48, 172]. After this, the inverse (separating) mapping from the data to the latent variables (sources) can be determined. It is well-known that any continuous smooth enough mapping can be approximated with arbitrary accuracy using a mixture-of-gaussians model with sufficiently many gaussian basis functions [172, 48]. Roughly stated, this provides the theoretical basis of the GTM method. A fundamental difference of the GTM method compared with SOM is that GTM is based on a generative approach that starts by assuming a model for the latent variables, in our case the sources. On the other hand, SOM A GENERATIVE TOPOGRAPHIC MAPPING APPROACH * 323 tries to separate the sources directly by starting from the data and constructing a suitable separating signal transformation. A key benefit of GTM is its firm theoretical foundation which helps to overcome some of the limitations of SOM. This also provides the basis of generalizing the GTM approach to arbitrary source densities. Using the basic GTM method instead of SOM for nonlinear blind source separation does not yet bring out any notable improvement, because the densities of the sources are still assumed to be uniform. However, it is straightforward to generalize the GTM method to arbitrary known source densities. The advantage of this approach is that one can directly regularize the inverse of the mixing mapping by using the known source densities. This modified GTM method is then used for finding a noncomplex mixing mapping. This approach is described in the following. 17.4.2 The modified GTM method The modified GTM method introduced in [346] differs from the standard GTM [49] only in that the required joint density of the latent variables (sources) is defined as a weighted sum of delta functions instead of plain delta functions. The weighting coefficients are determined by discretizing the known source densities. Only the main points of the GTM method are presented here, with emphasis on the modifications made for applying it to nonlinear blind source separation. Readers wishing to gain a deeper understanding of the GTM method should look at the original paper [49]. The GTM method closely resembles SOM in that it uses a discrete grid of points forming a regular array in the -dimensionallatent space. As in SOM, the dimension of the latent space is usually . Vectors lying in the latent space are denoted by ; in our application they will be source vectors. The GTM method uses a set of fixed nonlinear basis functions , , which form a nonorthogonal basis set. These basis functions typically consist of a regular array of spherical gaussian functions, but the basis functions can at least in principle be of other types. The mapping from the -dimensional latent space to the -dimensional data space, which is in our case the mixing mapping of Eq. (17.2), is in GTM modeled as a linear combination of basis functions : (17.11) Here is an matrix of weight parameters. Denote the node locations in the latent space by . Eq. (17.11) then defines a corresponding set of reference vectors (17.12) in data space. Each of these reference vectors then forms the center of an isotropic gaussian distribution in data space. Denoting the common variance of these gaussians by ,weget (17.13) 324 NONLINEAR ICA The probability density function for the GTM model is obtained by summing over all of the gaussian components, yielding (17.14) Here is the total number of gaussian components, which is equal to the number of grid points in latent space, and the prior probabilities of the gaussian components are all equal to . GTM tries to represent the distribution of the observed data in the -dimensional data space in terms of a smaller -dimensional nonlinear manifold [49]. The gaussian distribution in (17.13) represents a noise or error model which is needed because the data usually does not lie exactly in such a lower dimensional manifold. It is important to realize that the gaussian distributions defined in (17.13) have nothing to do with the basis function , . Usually it is advisable that the number of the basis functions is clearly smaller than the number of node locations and their respective noise distributions (17.13). In this way, one can avoid overfitting and prevent the mixing mapping (17.11) to become overly complicated. The unknown parameters in this model are the weight matrix and the inverse variance . These parameters are estimated by fitting the model (17.14) to the observed data vectors using the maximum likelihood method discussed earlier in Section 4.5. The log likelihood function of the observed data is given by (17.15) where is the variance of given and ,and is the total number of data vectors . For applying the modified GTM method, the probability density function of the source vectors should be known. Assuming that the sources are statistically independent, this joint density can be evaluated as the product of the marginal densities of the individual sources: (17.16) Each marginal density is here a discrete density defined at the sampling points corresponding to the locations of the node vectors. The latent space in the GTM method usually has a small dimension, typically . The method can be applied in principle for , but its computational load then increases quite rapidly just like in the SOM method. For this reason, only [...]... nonlinear factor analysis It is obvious that the data are quite nonlinear, because nonlinear factor analysis is able to explain the data with 10 components equally well as linear factor analysis (PCA) with 21 components Different numbers of hidden neurons and sources were tested with random initializations using gaussian model for sources (nonlinear factor analysis) It turned out that the Kullback-Leibler... [259, 438] Nonlinear independent component analysis or blind source separation are generally difficult problems both computationally and conceptually Therefore, local linear ICA/ BSS methods have received some attention recently as a practical compromise between linear ICA and completely nonlinear ICA or BSS These methods are more general than standard linear ICA in that several different linear ICA models... describe the observed data The local linear ICA models can be either overlapping, as in the mixture-of -ICA methods introduced in [273], or nonoverlapping, as in the clustering-based methods proposed in [234, 349] 17.7 CONCLUDING REMARKS In this chapter, generalizations of standard linear independent component analysis (ICA) or blind source separation (BSS) problems to nonlinear data models have been considered... mapping which yields independent components [345] Lin, Grier, and Cowan [279] have independently proposed using SOM for nonlinear ICA in a different manner by treating ICA as a computational geometry problem The ensemble learning approach to nonlinear ICA, discussed in more detail earlier in this chapter, is based on using multilayer perceptron networks as a flexible model for the nonlinear mixing mapping... FastICA algorithm alone Linear ICA is able to retrieve the original sources with only 0.7 dB signal-to-noise ratio (SNR) In practice a linear method could not deduce the number of sources, and the result would be even worse The poor signal-to-noise ratio shows that the data really lies in a nonlinear subspace Figure 17.9 depicts the results after 2000 sweeps with gaussian sources (nonlinear factor analysis) ... method (see Chapter 9), for nonlinear BSS, presenting also an extension based on entropy maximization and experiments with post -nonlinear mixtures Xu has developed the general Bayesian Ying-Yang framework which can be applied on ICA as well; see e.g [462, 463] Other general approaches proposed for solving the nonlinear ICA or BSS problems include a pattern repulsion based method [295], a state-space modeling... Fig 17.8 Original sources are on the x-axis of each scatter plot and the sources estimated by a linear ICA are on the y -axis Signal-to-noise ratio is 0.7 dB 17.5.5 Experimental results In all the simulations, the total number of sweeps was 7500, where one sweep means going through all the observations x(t) once As explained before, a nonlinear factor analysis (or nonlinear PCA subspace) representation... generated the observed data This regularization principle has a firm theoretical foundation, and it is intuitively satisfying for the nonlinear source separation problem The results are encouraging for both artificial and real-world data The ensemble-learning method allows nonlinear source separation for larger scale prob- 340 NONLINEAR ICA lems than some previously proposed computationally quite demanding... APPROACH TO NONLINEAR BSS 335 1 linear FA nonlinear FA 0.9 0.8 Remaining energy 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 Number of sources 20 25 Fig 17.11 The remaining energy of the process data as a function of the number of extracted components using linear and nonlinear factor analysis which was used for generating the data appears on the x-axis, and the respective estimated source in on the y -axis of... other methods proposed for nonlinear independent component analysis or blind source separation are briefly reviewed The interested reader can find more information on them in the given references Already in 1987, Jutten [226] used soft nonlinear mixtures for assessing the robustness and performance of the seminal H´ rault-Jutten algorithm introduced for e the linear BSS problem (see Chapter 12) However, Burel . 17 Nonlinear ICA This chapter deals with independent component analysis (ICA) for nonlinear mixing models. A fundamental difficulty in the nonlinear ICA. John Wiley & Sons, Inc. ISBNs: 0-4 7 1-4 0540-X (Hardback); 0-4 7 1-2 213 1-7 (Electronic) 316 NONLINEAR ICA mixtures, the nonlinear mixing model has the general

Ngày đăng: 20/01/2014, 11:20

Xem thêm