Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 24 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
24
Dung lượng
693,89 KB
Nội dung
12
ICA by Nonlinear
Decorrelation and
Nonlinear PCA
This chapter starts by reviewing some of the early research efforts in independent
component analysis (ICA), especially the technique based on nonlinear decorrelation,
that was successfully used by Jutten, H
´
erault, and Ans to solve the first ICA problems.
Today, this work is mainly of historical interest, because there exist several more
efficient algorithms for ICA.
Nonlinear decorrelation can be seen as an extension of second-order methods
such as whitening and principal componentanalysis (PCA). These methods give
components that are uncorrelated linear combinations of input variables, as explained
in Chapter 6. We will show that independent components can in some cases be found
as nonlinearly uncorrelated linear combinations. The nonlinear functions used in
this approach introduce higher order statistics into the solution method, making ICA
possible.
We then show how the work on nonlinear decorrelation eventually lead to the
Cichocki-Unbehauen algorithm, which is essentially the same as the algorithm that
we derived in Chapter 9 using the natural gradient. Next, the criterion of nonlinear
decorrelation is extended and formalized to the theory of estimating functions, and
the closely related EASI algorithm is reviewed.
Another approach to ICA that is related to PCA is the so-called nonlinear PCA.
A nonlinear representation is sought for the input data that minimizes a least mean-
square error criterion. For the linear case, it was shown in Chapter 6 that principal
components are obtained. It turns out that in some cases the nonlinear PCA approach
gives independent components instead. We review the nonlinear PCA criterion and
show its equivalence to other criteria like maximum likelihood (ML). Then, two
typical learning rules introduced by the authors are reviewed, of which the first one
239
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
240
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
is a stochastic gradient algorithm and the other one a recursive least mean-square
algorithm.
12.1 NONLINEAR CORRELATIONS AND INDEPENDENCE
The correlation between two random variables and was discussed in detail in
Chapter 2. Here we consider zero-mean variables only, so correlation and covariance
are equal. Correlation is related to independence in such a way that independent
variables are always uncorrelated. The opposite is not true, however: the variables
can be uncorrelated, yet dependent. An example is a uniform density in a rotated
square centered at the origin of the space, see e.g. Fig. 8.3. Both and
are zero mean and uncorrelated, no matter what the orientation of the square, but
they are independent only if the square is aligned with the coordinate axes. In some
cases uncorrelatedness does imply independence, though; the best example is the
case when the density of is constrained to be jointly gaussian.
Extending the concept of correlation, we here define the nonlinear correlation of
the random variables and as E . Here, and are two
functions, of which at least one is nonlinear. Typical examples might be polynomials
of degree higher than 1, or more complex functions like the hyperbolic tangent. This
means that one or both of the random variables are first transformed nonlinearly to
new variables and then the usual linear correlation between these new
variables is considered.
The question now is: Assuming that and are nonlinearly decorrelated in the
sense
E (12.1)
can we say something about their independence? We would hope that by making
this kind of nonlinear correlation zero, independence would be obtained under some
additional conditions to be specified.
There is a general theorem (see, e.g., [129]) stating that and are independent
if and only if
E E E (12.2)
for all continuous functions and that are zero outside a finite interval. Based
on this, it seems very difficult to approach independence rigorously, because the
functions and are almost arbitrary. Some kind of approximations are needed.
This problem was considered by Jutten and H
´
erault [228]. Let us assume that
and are smooth functions that have derivatives of all orders in a neighborhood
NONLINEAR CORRELATIONS AND INDEPENDENCE
241
of the origin. They can be expanded in Taylor series:
where is shorthand for the coefficients of the th powers in the series.
The product of the functions is then
(12.3)
and condition (12.1) is equivalent to
E E (12.4)
Obviously, a sufficient condition for this equation to hold is
E (12.5)
for all indices appearing in the series expansion (12.4). There may be other
solutions in which the higher order correlations are not zero, but the coefficients
happen to be just suitable to cancel the terms and make the sum in (12.4) exactly
equal to zero. For nonpolynomial functions that have infinite Taylor expansions, such
spurious solutions can be considered unlikely (we will see later that such spurious
solutions do exist but they can be avoided by the theory of ML estimation).
Again, a sufficient condition for (12.5) to hold is that the variables and are
independent and one of E E is zero. Let us require that E for all
powers appearing in its series expansion. But this is only possible if is an odd
function; then the Taylor series contains only odd powers , and the powers
in Eq. (12.5) will also be odd. Otherwise, we have the case that even moments of
like the variance are zero, which is impossible unless is constant.
To conclude, a sufficient (but not necessary) condition for the nonlinear uncorre-
latedness (12.1) to hold is that and are independent, and for one of them, say
, the nonlinearity is an odd function such that has zero mean.
The preceding discussion is informal but should make it credible that nonlinear
correlations are useful as a possible general criterion for independence. Several things
have to be decided in practice: the first one is how to actually choose the functions
. Is there some natural optimality criterion that can tell us that some functions
242
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
yx
1
21
12
-m
-m
2
1
2
x y
Fig. 12.1
The basic feedback circuit for the H
´
erault-Jutten algorithm. The element marked
with
is a summation
are better than some other ones? This will be answered in Sections 12.3 and 12.4.
The second problem is how we could solve Eq. (12.1), or nonlinearly decorrelate two
variables . This is the topic of the next section.
12.2 THE H
´
ERAULT-JUTTEN ALGORITHM
Consider the ICA model . Let us first look at a case, which was
considered by H
´
erault, Jutten and Ans [178, 179, 226] in connection with the blind
separation of two signals from two linear mixtures. The model is then
H
´
erault and Jutten proposed the feedback circuit shown in Fig. 12.1 to solve the prob-
lem. The initial outputs are fed back to the system, and the outputs are recomputed
until an equilibrium is reached.
From Fig. 12.1 we have directly
(12.6)
(12.7)
Before inputting the mixture signals to the network, they were normalized to
zero mean, which means that the outputs also will have zero means. Defining a
matrix with off-diagonal elements and diagonal elements equal to zero,
these equations can be compactly written as
Thus the input-output mapping of the network is
(12.8)
THE CICHOCKI-UNBEHAUEN ALGORITHM
243
Note that from the original ICA model we have , provided that is
invertible. If ,then becomes equal to . However, the problem in blind
separation is that the matrix is unknown.
The solution that Jutten and H
´
erault introduced was to adapt the two feedback
coefficients so that the outputs of the network become independent.
Then the matrix has been implicitly inverted and the original sources have been
found. For independence, they used the criterion of nonlinear correlations. They
proposed the following learning rules:
(12.9)
(12.10)
with
the learning rate. Both functions are odd functions; typically, the
functions
were used, although the method also seems to work for or sign .
Now, if the learning converges, then the right-hand sides must be zero on average,
implying
E E
Thus independence has hopefully been attained for the outputs . A stability
analysis for the H
´
erault-Jutten algorithm was presented by [408].
In the numerical computation of the matrix according to algorithm (12.9,12.10),
the outputs on the right-hand side must also be updated at each step of the
iteration. By Eq. (12.8), they too depend on , and solving them requires the
inversion of matrix . As noted by Cichocki and Unbehauen [84], this matrix
inversion may be computationally heavy, especially if this approach is extended to
more than two sources and mixtures. One way to circumvent this problem is to make
a rough approximation
that seems to work in practice.
Although the H
´
erault-Jutten algorithm was a very elegant pioneering solution to
the ICA problem, we know now that it has some drawbacks in practice. The algorithm
may work poorly or even fail to separate the sources altogether if the signals are badly
scaled or the mixing matrix is ill-conditioned. The number of sources that the method
can separate is severely limited. Also, although the local stability was shown in [408],
good global convergence behavior is not guaranteed.
12.3 THE CICHOCKI-UNBEHAUEN ALGORITHM
Starting from the H
´
erault-Jutten algorithm Cichocki, Unbehauen, and coworkers [82,
85, 84] derived an extension that has a much enhanced performance and reliability.
Instead of a feedback circuit like the H
´
erault-Jutten network in Fig. 12.1, Cichocki
244
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
and Unbehauen proposed a feedforward network with weight matrix , with the
mixture vector for input and with output . Now the dimensionality of the
problem can be higher than 2. The goal is to adapt the matrix so that the
elements of become independent. The learning algorithm for is as follows:
(12.11)
where
is the learning rate, is a diagonal matrix whose elements determine the
amplitude scaling for the elements of (typically, could be chosen as the unit
matrix ), and and are two nonlinear scalar functions; the authors proposed a
polynomial and a hyperbolic tangent. The notation means a column vector with
elements .
The argumentation showing that this algorithm will give independent components,
too, is based on nonlinear decorrelations. Consider the stationary solution of this
learning rule defined as the matrix for which E , with the expectation
taken over the density of the mixtures . For this matrix, the update is on the average
zero. Because this is a stochastic-approximation-typealgorithm(see Chapter 3), such
stationarity is a necessary condition for convergence. Excluding the trivial solution
,wemusthave
E
Especially, for the off-diagonal elements, this implies
E
(12.12)
which is exactly our definition of nonlinear decorrelation in Eq. (12.1) extended to
output signals . The diagonal elements satisfy
E
showing that the diagonal elements of matrix only control the amplitude scaling
of the outputs.
The conclusion is that if the learning rule converges to a nonzero matrix ,then
the outputs of the network must become nonlinearly decorrelated, and hopefully
independent. The convergence analysis has been performed in [84]; for general
principles of analyzing stochastic iteration algorithms like (12.11), see Chapter 3.
The justification for the Cichocki-Unbehauen algorithm (12.11) in the original
articles was based on nonlinear decorrelations, not on any rigorous cost functions
that would be minimized by the algorithm. However, it is interesting to note that
this algorithm, first appearing in the early 1990’s, is in fact the same as the popular
natural gradient algorithm introduced later by Amari, Cichocki, and Young [12] as
an extension to the original Bell-Sejnowski algorithm [36]. All we have to do is
choose as the unit matrix, the function as the linear function ,
and the function as a sigmoidal related to the true density of the sources. The
Amari-Cichocki-Young algorithm and the Bell-Sejnowski algorithm were reviewed
in Chapter 9 and it was shown how the algorithms are derived from the rigorous
maximum likelihood criterion. The maximum likelihood approach also tells us what
kind of nonlinearities should be used, as discussed in Chapter 9.
THE ESTIMATING FUNCTIONS APPROACH *
245
12.4 THE ESTIMATING FUNCTIONS APPR OACH *
Consider the criterion of nonlinear decorrelations being zero, generalized to
random
variables , shown in Eq. (12.12). Among the possible roots of
these equations are the source signals . When solving these in an algorithm
like the H
´
erault-Jutten algorithm or the Cichocki-Unbehauen algorithm, one in fact
solves the separating matrix
.
This notion was generalized and formalized by Amari and Cardoso [8] to the case
of estimating functions. Again, consider the basic ICA model ,
where is a true separating matrix (we use this special notation here to avoid any
confusion). An estimation function is a matrix-valued function such that
E (12.13)
This means that, taking the expectation with respect to the density of ,thetrue
separating matrices are roots of the equation. Once these are solved from Eq. (12.13),
the independent components are directly obtained.
Example 12.1 Given a set of nonlinear functions , with ,
and defining a vector function , a suitable estimating
function for ICA is
(12.14)
because obviously E
becomes diagonal when is a true separating matrix
and are independent and zero mean. Then the off-diagonal elements
become E E E . The diagonal matrix determines the
scales of the separated sources. Another estimating function is the right-hand side of
the learning rule (12.11),
There is a fundamental difference in the estimating function approach compared to
most of the other approaches to ICA: the usual starting point in ICA is a cost function
that somehow measures how independent or nongaussian the outputs are, and the
independent components are solved by minimizing the cost function. In contrast,
there is no such cost function here. The estimation function need not be the gradient
of any other function. In this sense, the theory of estimating functions is very general
and potentially useful for finding ICA algorithms. For a discussion of this approach
in connection with neural networks, see [328].
It is not a trivial question how to design in practice an estimation function so that
we can solve the ICA model. Even if we have two estimating functions that both
have been shaped in such a way that separating matrices are their roots, what is a
relevant measure to compare them? Statistical considerations are helpful here. Note
that in practice, the densities of the sources and the mixtures are unknown in
246
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
the ICA model. It is impossible in practice to solve Eq. (12.13) as such, because the
expectation cannot be formed. Instead, it has to be estimated using a finite sample of
. Denoting this sample by , we use the sample function
E
Its root is then an estimator for the true separating matrix. Obviously (see Chapter
4), the root is a function of the training sample, and it is
meaningful to consider its statistical properties like bias and variance. This gives a
measure of goodness for the comparison of different estimation functions. The best
estimating function is one that gives the smallest error between the true separating
matrix and the estimate .
A particularly relevant measure is (Fisher) efficiency or asymptotic variance, as
the size of the sample grows large (see Chapter 4). The goal is
to design an estimating function that gives the smallest variance, given the set of
observations . Then the optimal amount of information is extracted from the
training set.
The general result provided by Amari and Cardoso [8] is that estimating functions
of the form (12.14) are optimal in the sense that, given any estimating function ,
one can always find a better or at least equally good estimating function (in the sense
of efficiency) having the form
(12.15)
(12.16)
where is a diagonal matrix. Actually, the diagonal matrix has no effect on the
off-diagonal elements of which are the ones determining the independence
between ; the diagonal elements are simply scaling factors.
The result shows that it is unnecessary to use a nonlinear function instead of
as the other one of the two functions in nonlinear decorrelation. Only one nonlinear
function , combined with , is sufficient. It is interesting that functions of exactly
the type naturally emerge as gradients of cost functions such as likelihood;
the question of how to choose the nonlinearity is also answered in that case. A
further example is given in the following section.
The preceding analysis is not related in any way to the practical methods for finding
the roots of estimating functions. Due to the nonlinearities, closed-form solutions do
not exist and numerical algorithms have to be used. The simplest iterative stochastic
approximation algorithm for solving the roots of has the form
(12.17)
with an appropriate learning rate. In fact, we now discover that the learning rules
(12.9), (12.10) and (12.11) are examples of this more general framework.
EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE
247
12.5 EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE
In most of the proposed approaches to ICA, the learning rules are gradient descent
algorithms of cost (or contrast) functions. Many cases have been covered in previous
chapters. Typically, the cost function has the form
E , with
some scalar function, and usually some additional constraints are used. Here again
, and the form of the function and the probability density of determine
the shape of the contrast function .
It is easy to show (see the definition of matrix and vector gradients in Chapter 3)
that
E E (12.18)
where is the gradient of .If is square and invertible, then
and we have
E (12.19)
For appropriate nonlinearities , these gradients are estimating functions in
the sense that the elements of must be statistically independent when the gradient
becomes zero. Note also that in the form E , the first factor
has the shape of an optimal estimating function (except for the diagonal
elements); see eq. (12.15). Now we also know how the nonlinear function
can be determined: it is directly the gradient of the function appearing in the
original cost function.
Unfortunately, the matrix inversion in (12.19) is cumbersome. Matrix
inversion can be avoided by using the so-called natural gradient introduced by Amari
[4]. This is covered in Chapter 3. The natural gradient is obtained in this case by
multiplying the usual matrix gradient (12.19) from the right by matrix ,which
gives E . The ensuing stochastic gradient algorithm to minimize the cost
function is then
(12.20)
This learning rule again has the form of nonlinear decorrelations. Omitting the
diagonal elements in matrix in , the off-diagonal elements have the same
form as in the Cichocki-Unbehauen algorithm (12.11), with the two functions now
given by the linear function and the gradient .
This gradient algorithm can also be derived using the relative gradient introduced
by Cardoso and Hvam Laheld [71]. This approach is also reviewed in Chapter
3. Based on this, the authors developed their equivariant adaptive separation via
independence (EASI) learning algorithm. To proceed from (12.20) to the EASI
learning rule, an extra step must be taken. In EASI, as in many other learning
rules for ICA, a whitening preprocessing is considered for the mixture vectors
(see Chapter 6). We first transform linearly to whose elements have
248
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
unit variances and zero covariances: E . As also shown in Chapter 6, an
appropriate adaptation rule for whitening is
(12.21)
The ICA model using these whitened vectors instead of the original ones becomes
, and it is easily seen that the matrix is an orthogonal matrix (a rotation).
Thus its inverse which gives the separating matrix is also orthogonal. As in earlier
chapters, let us denote the orthogonal separating matrix by .
Basically, the learning rule for would be the same as (12.20). However, as
noted by [71], certain constraints must hold in any updating of if the orthogonality
is to be preserved at each iteration step. Let us denote the serial update for using
the learning rule (12.20), briefly, as , where now .
The orthogonality condition for the updated matrix becomes
where has been substituted. Assuming small, the first-order approxi-
mation gives the condition that ,or must be skew-symmetric. Applying
this condition to the relative gradient learning rule (12.20) for ,wehave
(12.22)
where now . Contrary to the learning rule (12.20), this learning rule also
takes care of the diagonal elements of in a natural way, without imposing
any conditions on them.
What is left now is to combine the two learning rules (12.21) and (12.22) into
just one learning rule for the global system separation matrix. Because
, this global separation matrix is . Assuming the same learning rates
for the two algorithms, a first order approximation gives
(12.23)
This is the EASI algorithm. It has the nice feature of combining both whitening
and separation into a single algorithm. A convergence analysis as well as some
experimental results are given in [71]. One can easily see the close connection to the
nonlinear decorrelation algorithm introduced earlier.
The concept of equivariance that forms part of the name of the EASI algorithm
is a general concept in statistical estimation; see, e.g., [395]. Equivariance of an
estimator means, roughly, that its performance does not depend on the actual value of
the parameter. In the context of the basic ICA model, this means that the ICs can be
estimated with the same performance what ever the mixing matrix may be. EASI was
one of the first ICA algorithms which was explicitly shown to be equivariant. In fact,
most estimators of the basic ICA model are equivariant. For a detailed discussion,
see [69].
[...]... densities, the same effect of rotation into independent directions would not be achieved Certainly, this would not take place for gaussian densities with equal variances, for which the criterion J ( 1 ::: n ) would be independent of the orientation Whether the criterion results in independent components, depends strongly on the nonlinearities gi (y ) A more detailed analysis of the criterion (12.25) and... gives the approximation to In the optimal solution that minimizes the criterion J ( 1 ::: n ), such factors might be termed nonlinear principal components Therefore, the technique of finding the basis vectors i is here called “nonlinear principal componentanalysis (NLPCA) It should be emphasized that practically always when a well-defined linear problem is extended into a nonlinear one, many ambiguities... statistically independent The goal in analyzing the learning rule (12.41) is to show that, starting from some initial value, the matrix will tend to the separating matrix For the transformed T in (12.45), this translates into the requirement that weight matrix = should tend to the unit matrix or a permutation matrix Then = also would tend to the vector , or a permuted version, with independent components... orthonormal For nonlinear functions gi (y ), however, this is usually not true Instead, in some cases, at least, it turns out that the optimal basis vectors wi minimizing (12.25) will be aligned with the independent components of the input vectors Example 12.2 Assume that x is a two-dimensional random vector that has a uniform density in a unit square that is not aligned with the coordinate axes x1 x2 , according... covariance matrix of x is therefore equal to 1=3I Thus, except for the scaling by 1=3, vector x is whitened (sphered) However, the elements are not independent The problem is to find a rotation s = Wx of x such that the elements of the rotated vector s are statistically independent It is obvious from Fig 12.2 that the elements of s must be aligned with the orientation of the square, because then and only then... hold that w1 w2 = 0 The solution minimizing the criterion (12.25), with w1 w2 orthogonal twodimensional vectors and g1 (:) = g2 (:) = g (:) a suitable nonlinearity, provides now a rotation into independent components This can be seen as follows Assume that g is a very sharp sigmoid, e.g., g(y) = tanh(10y), which is approximately the sign T function The term 2=1 g (wi x)wi in criterion (12.25) becomes...249 NONLINEAR PRINCIPAL COMPONENTS 12.6 NONLINEAR PRINCIPAL COMPONENTS One of the basic definitions of PCA was optimal least mean-square error compression, as explained in more detail in Chapter 6 Assuming a random m-dimensional zero-mean vector... assume that the ICA model holds, i.e., there exists an orthogonal separating matrix M such that s where the elements of s are statistically independent With whitening, the dimension of z has been reduced to that of s; thus both M and W are n n matrices To make further analysis easier, we proceed by making a linear transformation to the learning rule (12.41): we multiply both sides by the orthogonal separating... principal components are nonlinear functions of It should be noted that x w w x wx w 250 ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA minimizing the criterion (12.25) does not give a smaller least mean square error than standard PCA Instead, the virtue of this criterion is that it introduces higher-order statistics in a simple manner via the nonlinearities gi Before going into any deeper analysis. .. solution (although not the unique one) of this optimization problem is given by the eigenvectors 1 ::: n of the data covariance matrix x = Ef T g Then the linear T factors i in the sum become the principal components T i For instance, if is two-dimensional with a gaussian density, and we seek for a one-dimensional subspace (a straight line passing through the center of the density), then the solution is . be independent of
the orientation. Whether the criterion results in independent components, depends
strongly on the nonlinearities . A more detailed analysis. second-order methods
such as whitening and principal component analysis (PCA). These methods give
components that are uncorrelated linear combinations