ICA by Minimization of
Mutual Information
An important approach for independentcomponentanalysis (ICA) estimation, in-
spired by information theory, is minimization of mutual information.
The motivation of this approach is that it may not be very realistic in many cases
to assume that the data follows the ICA model. Therefore, we would like to develop
an approach that does not assume anything about the data. What we want to have
is a general-purpose measure of the dependence of the components of a random
vector. Using such a measure, we could define ICA as a linear decomposition that
minimizes that dependence measure. Such an approach can be developed using
mutual information, which is a well-motivated information-theoretic measure of
statistical dependence.
One of the main utilities of mutual information is that it serves as a unifying
framework for many estimation principles, in particular maximum likelihood (ML)
estimation and maximization of nongaussianity. In particular, this approach gives a
rigorous justification for the heuristic principle of nongaussianity.
10.1.1 Information-theoretic concepts
The information-theoretic concepts needed in this chapter were explained in Chap-
ter 5. Readers not familiar with information theory are advised to read that chapter
before this one.
Independent Component Analysis. Aapo Hyv
arinen, Juha Karhunen, Erkki Oja
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
We recall here very briefly the basic definitions of information theory. The
differential entropy of a random vector with density is defined as:
d (10.1)
Entropy is closely related to the code length of the random vector. A normalized
version of entropy i s g iven by negentropy , which is defined as follows
is a gaussian random vector of the same covariance (or correlation)
matrix as . Negentropy is always nonnegative, and zero only for gaussian random
vectors. Mutual information between (scalar) random variables, is
defined as follows
10.1.2 Mutual information as measure of dependence
We have seen earlier (Chapter 5) that mutual information is a natural measure of the
dependence between random variables. It is always nonnegative, and zero if and only
if the variables are statistically independent. Mutual information takes into account
the whole dependence structure of the variables, and not just the covariance, like
principal componentanalysis (PCA) and related methods.
Therefore, we can use mutual information as the criterion for finding the ICA
representation. This approach is an alternative to the model estimation approach. We
define the ICA of a random vector as an invertible transformation:
where the matrix is determined so that the mutual information of the transformed
components is minimized. If the data follows the ICA model,this allows estimation
of the data model. On the other hand, in this definition, we do not need to assume
that the data follows the model. In any case, minimization of mutual information can
be interpreted as giving the maximally independent components.
Using the formula for the differential entropy of a transformation as given in (5.13)
of Chapter 5, we obtain a corresponding result for mutual information. We have
for an invertible linear transformation
Now, let us consider what happens if we constrain the to be uncorrelated and of
unit variance. This means , which implies
and this implies that
must be constant since does not depend
on . Moreover, for of unit variance, entropy and negentropy differ only by a
constant and the sign, as can be seen in (10.2). Thus we obtain,
const. (10.7)
where the constant term does not depend on . This shows the fundamental relation
between negentropy and mutual information.
We see in (10.7) that finding an invertible linear transformation that minimizes
the mutual information is roughly equivalent to finding directions in which the ne-
gentropy is maximized. We have seen previously that negentropy is a measure of
nongaussianity. Thus, (10.7) shows that ICA estimation by minimization of mutual in-
formation is equivalent to maximizing the sum of nongaussianities of the estimates of
the independent components, when the estimates are constrained to be uncorrelated.
Thus, we see that the formulation of ICA as minimization of mutual information
gives another rigorous justification of our more heuristically introduced idea of finding
maximally nongaussian directions, as u sed in Chapter 8.
In practice, however, there are also some important differences between these two
1. Negentropy, and other measures of nongaussianity, enable the deflationary, i.e.,
one-by-one, estimation of the independent components, since we can look for
the maxima of nongaussianity of a single projection . This is not possible
with mutual information or most other criteria, like the likelihood.
2. A smaller difference is that in using nongaussianity, we force the estimates of
the independent components to be uncorrelated. This is not necessary when
using mutual information, because we could use the form in (10.5) directly,
as will be seen in the next section. Thus the optimization space is slightly
Mutual information and likelihood are intimately connected. To see the connection
between likelihood and mutual information, consider the expectation of the log-
likelihood in (9.5):
If the were equal to the actual pdf’s of , the first term would be equal to
. Thus the likelihood would be equal, up to an additive constant given
by the total entropy of , to the negative of mutual information as given in Eq. (10.5).
In practice, the connection may be just as strong, or even stronger. This is because
in practice we do not know the distributions of the independent components that are
needed in ML estimation. A reasonable approach would be to estimate the density
of as part of the ML estimation method, and use this as an approximation of the
density of . This is what we did in Chapter 9. Then, the in this approximation
of likelihood are indeed equal to the actual pdf’s . Thus, the equivalency would
really hold.
Conversely, to approximate mutual information, we could take a fixed approxi-
mation of the densities , and plug this in the definition of entropy. Denote the pdf’s
by . Then we could approximate (10.5) as
Now we see that this approximation is equal to the approximation of the likelihood
used in Chapter 9 (except, again, for the global sign and the additive constant given by
). This also gives an alternative method of approximating mutual information
that is different from the approximation that uses the negentropy approximations.
To use mutual information in practice, we need some method of estimating or ap-
proximating it from real data. Earlier, we saw two methods for approximating mutual
entropy. The first one was based on the negentropy approximations introduced in
Section 5.6. The second one was based on using more or less fixed approximations
for the densities of the ICs in Chapter 9.
Thus, using mutual information leads essentially to the same algorithms as used for
maximization o f nongaussianity in Chapter 8, or for maximum likelihood estimation
in Chapter 9. In the case of maximization of nongaussianity, the corresponding
algorithms are those that use symmetric orthogonalization, since we are maximizing
the sum of nongaussianities, so that no order exists between the components. Thus,
we do not present any new algorithms in this chapter; the reader is referred to the two
preceding chapters.
0 0.5 1 1.5 2 2.5 3
x 10
iteration count
mutual information
Fig. 10.1
The convergence of FastICA for ICs with uni form distributions. The value of
mutual information shown as function of iteration count.
Here we show the results of applying minimization of mutual information to the
two mixtures introduced in Chapter 7. We use here the whitened mixtures, and the
FastICA algorithm (which is essentially identical whichever approximation of mutual
information is used). For illustration purposes, the algorithm was always initialized
so that
was the identity matrix. The function was chosen as in (8.26).
First, we used the data consisting of two mixtures of two subgaussian (uniformly
distributed) independent components. To demonstrate the convergence of the al-
gorithm, the mutual information of the components at each iteration step is plotted
in Fig. 10.1. This was obtained by the negentropy-based approximation. At con-
vergence, after two iterations, mutual information was practically equal to zero.
The corresponding results for two supergaussian independent components are shown
in Fig. 10.2. Convergence was obtained after three iterations, after which mutual
information was practically zero.
A rigorous approach to ICA that is different from the maximum likelihood approach
is given by minimization of mutual information. Mutual information is a natural
information-theoretic measure of dependence, and therefore it is natural to estimate
the independent components by minimizing the mutualinformation of their estimates.
Mutual information gives a rigorous justification of the principle of searching for
maximally nongaussian directions, and in the end turns out to be very similar to the
likelihood as well.
Mutual information can be approximated by the same methods that negentropy is
approximated. Alternatively, is can be approximated in the same way as likelihood.
0 0.5 1 1.5 2 2.5 3
x 10
iteration count
mutual information
Fig. 10.2
The convergence of FastICA for ICs with supergaussian di stributions. The value
of mutual information shown as f unction of iteration count.
Therefore, we find here very much the same objective functions and algorithms as
in maximization of nongaussianity and maximum likelihood. The same gradient and
fixed-point algorithms can be used to optimize mutual information.
Estimation of ICA by minimization of mutual information was probably first
proposed in [89], who derived an approximation based on cumulants. The idea has,
however, a longer history in the context of neural network research, where it has
been proposed as a sensory coding strategy. It was proposed in [26, 28, 30, 18],
that decomposing sensory data into features that are maximally independent is useful
as a preprocessing step. Our approach follows that of [197] for the negentropy
A nonparametric algorithm for minimization of mutual information was proposed
in [175], and an approach based on order statistics was proposed in [369]. See
[322, 468] for a detailed analysis of the connection between mutual information and
infomax or maximum likelihood. A more general framework was proposed in [377].
10.1 Derive the formula in (10.5).
10.2 Compute the constant in (10.7).
10.3 If the variances of the are not constrained to unity, does this constant
10.4 Computethe mutual information for a gaussian random vector with covariance
matrix .
Computer assignments
10.1 Create a sample of 2-D gaussian data with the two covariance matrices
and (10.10)
Estimate numerically the mutual information using the definition. (Divide the data
into bins, i.e., boxes of fixed size, and estimate the density at each bin by computing
the number of data points that belong to that bin and dividing it by the size of the bin.
This elementary density approximation can then be used in the definition.)
. by Minimization of
Mutual Information
An important approach for independent component analysis (ICA) estimation, in-
spired by information theory, is minimization. information theory are advised to read that chapter
before this one.
Independent Component Analysis. Aapo Hyv
arinen, Juha Karhunen, Erkki Oja