Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 26 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
26
Dung lượng
1,14 MB
Nội dung
17
Nonlinear ICA
This chapter deals with independentcomponentanalysis (ICA) for nonlinear mixing
models. A fundamental difficulty in the nonlinearICA problem is that it is highly
nonunique withoutsome extra constraints, which are often realized by using a suitable
regularization. We also address the nonlinear blind source separation (BSS) problem.
Contrary to the linear case, we consider it different from the respective nonlinear ICA
problem. After considering these matters, some methods introduced for solving the
nonlinear ICA or BSS problems are discussed in more detail. Special emphasis is
given to a Bayesian approach that applies ensemble learning to a flexible multilayer
perceptron model for finding the sources and nonlinear mixing mapping that have
most probably given rise to the observed mixed data. The efficiency of this method is
demonstrated using both artificial and real-world data. At the end of the chapter, other
techniques proposed for solving the nonlinearICA and BSS problems are reviewed.
17.1 NONLINEARICA AND BSS
17.1.1 The nonlinearICA and BSS problems
In many situations, the basic linear ICA or BSS model
(17.1)
is too simple for describing the observed data adequately. Hence, it is natural to
consider extension of the linear model to nonlinear mixing models. For instantaneous
315
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
316
NONLINEAR ICA
mixtures, the nonlinear mixing model has the general form
(17.2)
where is the observed -dimensional data (mixture) vector, is an unknown real-
valued -component mixing function, and is an -vector whose elements are the
unknown independent components.
Assume now for simplicity that the number of independent components equals
the number of mixtures . The general nonlinearICA problem then consists of
finding a mapping that gives components
(17.3)
that are statistically independent. A fundamental characteristic of the nonlinear
ICA problem is that in the general case, solutions always exist, and they are highly
nonunique. One reason for this is that if
and are two independent random
variables, any of their functions and are also independent. An even more
serious problem is that in the nonlinear case, and can be mixed and still statistically
independent, as will be shown below. This is not unlike in the case of gaussian ICs
in a linear mixing.
In this chapter, we define BSS in a special way to clarify the distinction between
finding independent components, and finding the original sources. Thus, in the
respective nonlinear BSS problem, one should find the original source signals
that have generated the observed data. This is usually a clearly more meaningful
and unique problem than nonlinearICA defined above, provided that suitable prior
information is available on the sources and/or the mixing mapping. It is worth
emphasizing that if some arbitrary independent components are found for the data
generated by (17.2), they may be quite different from the true source signals. Hence
the situation differs greatly from the basic linear data model (17.1), for which the
ICA or BSS problems have the same solution. Generally, solving the nonlinear BSS
problem is not easy, and requires additional prior information or suitable regularizing
constraints.
An important special case of the general nonlinear mixing model (17.2) consists
of so-called post-nonlinear mixtures. There each mixture has the form
(17.4)
Thus the sources , are first mixed linearly according to the basic
ICA/BSS model (17.1), but after that a nonlinear function is applied to them to get
the final observations . It can be shown [418] that for the post-nonlinear mixtures,
the indeterminacies are usually the same as for the basic linear instantaneous mixing
model (17.1). That is, the sources can be separated or the independent compo-
nents estimated up to the scaling, permutation, and sign indeterminacies under weak
conditions on the mixing matrix and source distributions. The post-nonlinearity
assumption is useful and reasonable in many signal processing applications, because
NONLINEAR ICA AND BSS
317
it can be thought of as a model for a nonlinear sensor distortion. In more general
situations, it is a restrictive and somewhat arbitrary constraint. This model will be
treated in more detail below.
Another difficulty in the general nonlinear BSS (or ICA) methods proposed thus
far is that they tend to be computationally rather demanding. Moreover, the compu-
tational load usually increases very rapidly with the dimensionality of the problem,
preventing in practice the application of nonlinear BSS methods to high-dimensional
data sets.
The nonlinear BSS and ICA methods presented in the literature could be divided
into two broad classes: generative approaches and signal transformation approaches
[438]. In the generative approaches, the goal is to find a specific model that explains
how the observations were generated. In our case, this amounts to estimating both
the source signals and the unknown mixing mapping that have generated the
observed data
through the general mapping (17.2). In the signal transformation
methods, one tries to estimate the sources directly using the inverse transformation
(17.3). In these methods, the number of estimated sources is the same as the number
of observed mixtures [438].
17.1.2 Existence and uniqueness of nonlinear ICA
The question of existence and uniqueness of solutions for nonlinear independent
component analysis has been addressed in [213]. The authors show that there always
exists an infinity of solutions if the space of the nonlinear mixing functions
is
not limited. They also present a method for constructing parameterized families
of nonlinearICA solutions. A unique solution (up to a rotation) can be obtained
in the two-dimensional special case if the mixing mapping is constrained to be a
conformal mapping together with some other assumptions; see [213] for details.
In the following, we present in more detail the constructive method introduced in
[213] that always yields at least one solution to the nonlinearICA problem. This
procedure might be considered as a generalization of the well-known Gram-Schmidt
orthogonalization method. Given independent variables = and a
variable , a new variable = is constructed so that the set
is mutually independent.
The construction is defined recursively as follows. Assume that we have already
independent random variables which are jointly uniformly distributed
in . Here it is not a restriction to assume that the distributions of the are
uniform, since this follows directly from the recursion, as will be seen below; for a
single variable, uniformity can be attained by the probability integral transformation;
see (2.85). Denote by any random variable, and by some nonrandom
318
NONLINEAR ICA
scalars. Define
(17.5)
where and are the marginal probability densities of and ,
respectively (it is assumed here implicitly that such densities exist),and denotes
the conditional probability. The in the argument of is to remind that depends
on the joint probability distribution of and .For , is simply the cumulative
distribution function of .Now, as defined above gives a nonlinear decomposition,
as stated in the following theorem.
Theorem 17.1 Assume that are independent scalar random variables
that have a joint uniform distribution in the unit cube .Let be any scalar
random variable. Define as in (17.5), and set
(17.6)
Then is independent from the , and the variables are
jointly uniformly distributed in the unit cube .
The theorem is proved in [213]. The constructive method given above can be used
to decompose variables into independent components ,
giving a solution for the nonlinearICA problem.
This construction also clearly shows that the decomposition in independent com-
ponents is by no means unique. For example, we could first apply a linear trans-
formation on the to obtain another random vector = , and then compute
= with being defined using the above procedure, where is replaced by
. Thus we obtain another decomposition of into independent components. The
resulting decomposition = is in general different from , and cannot be
reduced to by any simple transformations. A more rigorous justification of the
nonuniqueness property has been given in [213].
Lin [278] has recently derived some interesting theoretical results on ICA that
are useful in describing the nonuniqueness of the general nonlinearICA problem.
Let the matrices and denote the Hessians of the logarithmic probability
densities and of the source vector and mixture (data) vector ,
respectively. Then for the basic linear ICA model (17.1) it holds that
(17.7)
where is the mixing matrix. If the components of are truly independent,
should be a diagonal matrix. Due to the symmetry of the Hessian matrices and
, Eq. (17.7) imposes constraints for the elements of the matrix
. Thus a constant mixing matrix can be solved by estimating at two different
points, and assuming some values for the diagonal elements of .
SEPARATION OF POST-NONLINEAR MIXTURES
319
If the nonlinear mapping (17.2) is twice differentiable, we can approximate it
locally at any point by the linear mixing model (17.1). There is defined by the
first order term of the Taylor series expansion of at the desired point.
But now generally changes from point to point, so that the constraint conditions
(17.7) still leave degrees of freedom for determining the mixing matrix
(omitting the diagonal elements). This also shows that the nonlinearICA problem
is highly nonunique.
Taleb and Jutten have considered separability of nonlinear mixtures in [418, 227].
Their general conclusion is the same as earlier: Separation is impossible without
additional prior knowledge on the model, since the independence assumption alone
is not strong enough in the general nonlinear case.
17.2 SEPARATION OF POST-NONLINEAR MIXTURES
Before discussing approaches applicable to general nonlinear mixtures, let us briefly
consider blind separation methods proposed for the simpler case of post-nonlinear
mixtures (17.4). Especially Taleb and Jutten have developed BSS methods for this
case. Their main results have been represented in [418], and a short overview of their
studies on this problem can be found in [227]. In the following, we present the the
main points of their method.
A separation method for the post-nonlinear mixtures (17.4) should generally con-
sist of two subsequent parts or stages:
1. A nonlinear stage, which should cancel the nonlinear distortions
. This part consists of nonlinear functions . The parameters
of each nonlinearity are adjusted so that cancellation is achieved (at least
roughly).
2. A linear stage that separates the approximately linear mixtures obtainedafter
the nonlinear stage. This is done as usual by learning a separating matrix
for which the components of the output vector = of the separating
system are statistically independent (or as independent as possible).
Taleb and Jutten [418] use the mutual information between the components
of the output vector (see Chapter 10) as the cost function and independence
criterion in both stages. For the linear part, minimization of the mutual information
leads to the familiar Bell-Sejnowski algorithm (see Chapters 10 and 9)
E (17.8)
where components of the vector are score functions of the components of
the output vector :
(17.9)
320
NONLINEAR ICA
Here is the probability density function of and its derivative. In practice,
the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8);
see Chapter 9.
For the nonlinear stage, one can derive the gradient learning rule [418]
E E
Here is the th component of the input vector, is the element of the matrix
,and is the derivative of the th nonlinear function . The exact computation
algorithm depends naturally on the specific parametric form of the chosen nonlinear
mapping . In [418], a multilayer perceptron network is used for modeling
the functions , .
In linear BSS, it suffices that the score functions (17.9) are of the right type for
achieving separation. However, their appropriate estimation is critical for the good
performance of the proposed nonlinear separation method. The score functions (17.9)
must be estimated adaptively from the output vector . Several alternative ways to
do this are considered in [418]. An estimation method based on the Gram-Charlier
expansion performs appropriately only for mild post-nonlinear distortions. However,
another method, which estimates the score functions directly, also provides very good
results for hard nonlinearities. Experimental results are presented in [418]. A well
performing batch type method for estimating the score functions has been introduced
in a later paper [417].
Before proceeding, we mention that separation of post-nonlinear mixtures also
has been studied in [271, 267, 469] using mainly extensions of the natural gradient
algorithm.
17.3 NONLINEAR BSS USING SELF-ORGANIZING MAPS
One of the earliest ideas for achieving nonlinear BSS (or ICA) is to use Kohonen’s
self-organizing map (SOM) to that end. This method was originally introduced by
Pajunen et al. [345]. The SOM [247, 172] is a well-known mapping and visualization
method that in an unsupervised manner learns a nonlinear mapping from the data to
a usually two-dimensional grid. The learned mapping from often high-dimensional
data space to the grid is such that it tries to preserve the structure of the data as well
as possible. Another goal in the SOM method is to map the data so that it would be
uniformly distributed on the rectangular (or hexagonal) grid. This can be roughly
achieved with suitable choices [345].
If the joint probability density of two random variables is uniformly distributed
inside a rectangle, then clearly the marginal densities along the sides of the rectangle
are statistically independent. This observation gives the justification for applying
self-organizing map to nonlinear BSS or ICA. The SOM mapping provides the
regularization needed in nonlinear BSS, because it tries to preserve the structure
NONLINEAR BSS USING SELF-ORGANIZING MAPS
321
of the data. This implies that the mapping should be as simple as possible while
achieving the desired goals.
0 5 10 15 20 25 30 35 40 45 50
−1
−0.5
0
0.5
1
0 5 10 15 20 25 30 35 40 45 50
−1.5
−1
−0.5
0
0.5
1
1.5
Original signals
Fig. 17.1
Original source signals.
0 5 10 15 20 25 30 35 40 45 50
−10
−5
0
5
10
0 5 10 15 20 25 30 35 40 45 50
−20
−10
0
10
20
Fig. 17.2
Nonlinear mixtures.
The following experiment [345] illustrates the use of the self-organizing map in
nonlinear blind source separation. There were two subgaussian source signals
shown in Fig. 17.1, consisting of a sinusoid and uniformly distributed white noise.
Each source vector was first mixed linearly using the mixing matrix
(17.10)
After this, the data vectors were obtained as post-nonlinear mixtures of the sources
by applying the formula (17.4), where the nonlinearity = , .These
mixtures are depicted in Fig. 17.2.
0 5 10 15 20 25 30 35 40 45 50
−20
−15
−10
−5
0
0 5 10 15 20 25 30 35 40 45 50
−15
−10
−5
0
5
Fig. 17.3
Signals separated by SOM.
−10 −8 −6 −4 −2 0 2 4 6 8 10
−15
−10
−5
0
5
10
15
Fig. 17.4
Converged SOM map.
The sources separated by the SOM method are shown in Fig. 17.3, and the
converged SOM map is illustrated in Fig. 17.4. The estimates of the source signals
in Fig. 17.3 are obtained by mapping each data vector onto the map of Fig. 17.4,
322
NONLINEAR ICA
and reading the coordinates of the mapped data vector. Even though the preceding
experiment was carried out with post-nonlinear mixtures, the use of the SOM method
is not limited to them.
Generally speaking, there are several difficulties in applying self-organizing maps
to nonlinear blind source separation. If the sources are uniformly distributed, then
it can be heuristically justified that the regularization of the nonlinear separating
mapping provided by the SOM approximately separates the sources. But if the true
sources are not uniformly distributed, the separating mapping providing uniform
densities inevitably causes distortions, which are in general the more serious the
farther the true source densities are from the uniform ones. Of course, the SOM
method still provides an approximate solution to the nonlinearICA problem, but this
solution may have little to do with the true source signals.
Another difficulty in using SOM for nonlinear BSS or ICA is that computational
complexity increases very rapidly with the number of the sources (dimensionality of
the map), limiting the potential application of this method to small-scale problems.
Furthermore, the mapping provided by the SOM is discrete, where the discretization
is determined by the number of grid points.
17.4 A GENERATIVE TOPOGRAPHIC MAPPING APPROACH TO
NONLINEAR BSS *
17.4.1 Background
The self-organizing map discussed briefly in the previous section is a nonlinear
mapping method that is inspired by neurobiological modeling arguments. Bishop,
Svensen and Williams introduced the generative topographic mapping (GTM) method
as a statistically more principled alternative to SOM. Their method is presented in
detail in [49].
In the basic GTM method, mutually similar impulse (delta) functions that are
equispaced on a rectangular grid are used to model the discrete uniform density in the
space of latent variables, or the joint density of the sources in our case. The mapping
from the sources to the observed data, corresponding in our nonlinear BSS problem
to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussians
model. The parameters of the mixture-of-gaussians model, defining the mixing
mapping, are then estimated using a maximum likelihood (ML) method (see Section
4.5) realized by the expectation-maximization (EM) algorithm [48, 172]. After this,
the inverse (separating) mapping from the data to the latent variables (sources) can
be determined.
It is well-known that any continuous smooth enough mapping can be approximated
with arbitrary accuracy using a mixture-of-gaussians model with sufficiently many
gaussian basis functions [172, 48]. Roughly stated, this provides the theoretical
basis of the GTM method. A fundamental difference of the GTM method compared
with SOM is that GTM is based on a generative approach that starts by assuming
a model for the latent variables, in our case the sources. On the other hand, SOM
A GENERATIVE TOPOGRAPHIC MAPPING APPROACH *
323
tries to separate the sources directly by starting from the data and constructing a
suitable separating signal transformation. A key benefit of GTM is its firm theoretical
foundation which helps to overcome some of the limitations of SOM. This also
provides the basis of generalizing the GTM approach to arbitrary source densities.
Using the basic GTM method instead of SOM for nonlinear blind source separation
does not yet bring out any notable improvement, because the densities of the sources
are still assumed to be uniform. However, it is straightforward to generalize the GTM
method to arbitrary known source densities. The advantage of this approach is that
one can directly regularize the inverse of the mixing mapping by using the known
source densities. This modified GTM method is then used for finding a noncomplex
mixing mapping. This approach is described in the following.
17.4.2 The modified GTM method
The modified GTM method introduced in [346] differs from the standard GTM [49]
only in that the required joint density of the latent variables (sources) is defined as
a weighted sum of delta functions instead of plain delta functions. The weighting
coefficients are determined by discretizing the known source densities. Only the main
points of the GTM method are presented here, with emphasis on the modifications
made for applying it to nonlinear blind source separation. Readers wishing to gain a
deeper understanding of the GTM method should look at the original paper [49].
The GTM method closely resembles SOM in that it uses a discrete grid of points
forming a regular array in the -dimensionallatent space. As in SOM, the dimension
of the latent space is usually . Vectors lying in the latent space are denoted by
; in our application they will be source vectors. The GTM method uses a set of
fixed nonlinear basis functions , , which form a nonorthogonal
basis set. These basis functions typically consist of a regular array of spherical
gaussian functions, but the basis functions can at least in principle be of other types.
The mapping from the -dimensional latent space to the -dimensional data
space, which is in our case the mixing mapping of Eq. (17.2), is in GTM modeled as
a linear combination of basis functions :
(17.11)
Here is an matrix of weight parameters.
Denote the node locations in the latent space by . Eq. (17.11) then defines a
corresponding set of reference vectors
(17.12)
in data space. Each of these reference vectors then forms the center of an isotropic
gaussian distribution in data space. Denoting the common variance of these gaussians
by ,weget
(17.13)
324
NONLINEAR ICA
The probability density function for the GTM model is obtained by summing over
all of the gaussian components, yielding
(17.14)
Here
is the total number of gaussian components, which is equal to the number of
grid points in latent space, and the prior probabilities of the gaussian components
are all equal to .
GTM tries to represent the distribution of the observed data in the -dimensional
data space in terms of a smaller -dimensional nonlinear manifold [49]. The gaussian
distribution in (17.13) represents a noise or error model which is needed because the
data usually does not lie exactly in such a lower dimensional manifold. It is important
to realize that the gaussian distributions defined in (17.13) have nothing to do with
the basis function , . Usually it is advisable that the number of
the basis functions is clearly smaller than the number of node locations and their
respective noise distributions (17.13). In this way, one can avoid overfitting and
prevent the mixing mapping (17.11) to become overly complicated.
The unknown parameters in this model are the weight matrix and the inverse
variance . These parameters are estimated by fitting the model (17.14) to the
observed data vectors using the maximum likelihood method
discussed earlier in Section 4.5. The log likelihood function of the observed data is
given by
(17.15)
where
is the variance of given and ,and is the total number of data
vectors .
For applying the modified GTM method, the probability density function
of the source vectors should be known. Assuming that the sources
are statistically independent, this joint density can be evaluated as the product of the
marginal densities of the individual sources:
(17.16)
Each marginal density is here a discrete density defined at the sampling points
corresponding to the locations of the node vectors.
The latent space in the GTM method usually has a small dimension, typically
. The method can be applied in principle for , but its computational
load then increases quite rapidly just like in the SOM method. For this reason, only
[...]... nonlinear factor analysis It is obvious that the data are quite nonlinear, because nonlinear factor analysis is able to explain the data with 10 components equally well as linear factor analysis (PCA) with 21 components Different numbers of hidden neurons and sources were tested with random initializations using gaussian model for sources (nonlinear factor analysis) It turned out that the Kullback-Leibler... [259, 438] Nonlinearindependentcomponentanalysis or blind source separation are generally difficult problems both computationally and conceptually Therefore, local linear ICA/ BSS methods have received some attention recently as a practical compromise between linear ICA and completely nonlinearICA or BSS These methods are more general than standard linear ICA in that several different linear ICA models... describe the observed data The local linear ICA models can be either overlapping, as in the mixture-of -ICA methods introduced in [273], or nonoverlapping, as in the clustering-based methods proposed in [234, 349] 17.7 CONCLUDING REMARKS In this chapter, generalizations of standard linear independentcomponentanalysis (ICA) or blind source separation (BSS) problems to nonlinear data models have been considered... mapping which yields independent components [345] Lin, Grier, and Cowan [279] have independently proposed using SOM for nonlinearICA in a different manner by treating ICA as a computational geometry problem The ensemble learning approach to nonlinear ICA, discussed in more detail earlier in this chapter, is based on using multilayer perceptron networks as a flexible model for the nonlinear mixing mapping... FastICA algorithm alone Linear ICA is able to retrieve the original sources with only 0.7 dB signal-to-noise ratio (SNR) In practice a linear method could not deduce the number of sources, and the result would be even worse The poor signal-to-noise ratio shows that the data really lies in a nonlinear subspace Figure 17.9 depicts the results after 2000 sweeps with gaussian sources (nonlinear factor analysis) ... method (see Chapter 9), for nonlinear BSS, presenting also an extension based on entropy maximization and experiments with post -nonlinear mixtures Xu has developed the general Bayesian Ying-Yang framework which can be applied on ICA as well; see e.g [462, 463] Other general approaches proposed for solving the nonlinearICA or BSS problems include a pattern repulsion based method [295], a state-space modeling... Fig 17.8 Original sources are on the x-axis of each scatter plot and the sources estimated by a linear ICA are on the y -axis Signal-to-noise ratio is 0.7 dB 17.5.5 Experimental results In all the simulations, the total number of sweeps was 7500, where one sweep means going through all the observations x(t) once As explained before, a nonlinear factor analysis (or nonlinear PCA subspace) representation... generated the observed data This regularization principle has a firm theoretical foundation, and it is intuitively satisfying for the nonlinear source separation problem The results are encouraging for both artificial and real-world data The ensemble-learning method allows nonlinear source separation for larger scale prob- 340 NONLINEARICA lems than some previously proposed computationally quite demanding... APPROACH TO NONLINEAR BSS 335 1 linear FA nonlinear FA 0.9 0.8 Remaining energy 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 Number of sources 20 25 Fig 17.11 The remaining energy of the process data as a function of the number of extracted components using linear and nonlinear factor analysis which was used for generating the data appears on the x-axis, and the respective estimated source in on the y -axis of... other methods proposed for nonlinearindependentcomponentanalysis or blind source separation are briefly reviewed The interested reader can find more information on them in the given references Already in 1987, Jutten [226] used soft nonlinear mixtures for assessing the robustness and performance of the seminal H´ rault-Jutten algorithm introduced for e the linear BSS problem (see Chapter 12) However, Burel . 17
Nonlinear ICA
This chapter deals with independent component analysis (ICA) for nonlinear mixing
models. A fundamental difficulty in the nonlinear ICA. John Wiley & Sons, Inc.
ISBNs: 0-4 7 1-4 0540-X (Hardback); 0-4 7 1-2 213 1-7 (Electronic)
316
NONLINEAR ICA
mixtures, the nonlinear mixing model has the general