10 ICAbyMinimizationofMutualInformation An important approach for independent component analysis (ICA) estimation, in- spired byinformation theory, is minimizationofmutual information. The motivation of this approach is that it may not be very realistic in many cases to assume that the data follows the ICA model. Therefore, we would like to develop an approach that does not assume anything about the data. What we want to have is a general-purpose measure of the dependence of the components of a random vector. Using such a measure, we could define ICA as a linear decomposition that minimizes that dependence measure. Such an approach can be developed using mutual information, which is a well-motivated information-theoretic measure of statistical dependence. One of the main utilities ofmutualinformation is that it serves as a unifying framework for many estimation principles, in particular maximum likelihood (ML) estimation and maximization of nongaussianity. In particular, this approach gives a rigorous justification for the heuristic principle of nongaussianity. 10.1 DEFINING ICABYMUTUALINFORMATION 10.1.1 Information-theoretic concepts The information-theoretic concepts needed in this chapter were explained in Chap- ter 5. Readers not familiar with information theory are advised to read that chapter before this one. 221 Independent Component Analysis. Aapo Hyv ¨ arinen, Juha Karhunen, Erkki Oja Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 222 ICABYMINIMIZATIONOFMUTUALINFORMATION We recall here very briefly the basic definitions ofinformation theory. The differential entropy H of a random vector y with density p(y) is defined as: H (y)= Z p(y)logp(y) d y (10.1) Entropy is closely related to the code length of the random vector. A normalized version of entropy is given by negentropy J , which is defined as follows J (y)=H (y gauss ) H (y) (10.2) where y gauss is a gaussian random vector of the same covariance (or correlation) matrix as y . Negentropy is always nonnegative, and zero only for gaussian random vectors. Mutualinformation I between m (scalar) random variables, y i i =1:::m is defined as follows I (y 1 y 2 :::y m )= m X i=1 H (y i ) H (y) (10.3) 10.1.2 Mutualinformation as measure of dependence We have seen earlier (Chapter 5) that mutualinformation is a natural measure of the dependence between random variables. It is always nonnegative, and zero if and only if the variables are statistically independent. Mutualinformation takes into account the whole dependence structure of the variables, and not just the covariance, like principal component analysis (PCA) and related methods. Therefore, we can use mutualinformation as the criterion for finding the ICA representation. This approach is an alternative to the model estimation approach. We define the ICAof a random vector x as an invertible transformation: s = Bx (10.4) where the matrix B is determined so that the mutualinformationof the transformed components s i is minimized. If the data follows the ICA model, this allows estimation of the data model. On the other hand, in this definition, we do not need to assume that the data follows the model. In any case, minimizationofmutualinformation can be interpreted as giving the maximally independent components. MUTUALINFORMATION AND NONGAUSSIANITY 223 10.2 MUTUALINFORMATION AND NONGAUSSIANITY Using the formula for the differential entropy of a transformation as given in (5.13) of Chapter 5, we obtain a corresponding result for mutual information. We have for an invertible linear transformation y = Bx : I (y 1 y 2 ::: y n )= X i H (y i ) H (x) log j det Bj (10.5) Now, let us consider what happens if we constrain the y i to be uncorrelated and of unit variance. This means E fyy T g = BE fxx T gB T = I , which implies det I = 1 = det(BE fxx T gB T ) = (det B)(det E fxx T g)(det B T ) (10.6) and this implies that det B must be constant since det E fxx T g does not depend on B . Moreover, for y i of unit variance, entropy and negentropy differ only by a constant and the sign, as can be seen in (10.2). Thus we obtain, I (y 1 y 2 ::: y n )= const. X i J (y i ) (10.7) where the constant term does not depend on B . This shows the fundamental relation between negentropy and mutual information. We see in (10.7) that finding an invertible linear transformation B that minimizes the mutualinformation is roughly equivalent to finding directions in which the ne- gentropy is maximized. We have seen previously that negentropy is a measure of nongaussianity. Thus, (10.7) shows that ICA estimation byminimizationofmutual in- formation is equivalent to maximizing the sum of nongaussianities of the estimates of the independent components, when the estimates are constrained to be uncorrelated. Thus, we see that the formulation ofICA as minimizationofmutualinformation gives another rigorous justification of our more heuristically introduced idea of finding maximally nongaussian directions, as used in Chapter 8. In practice, however, there are also some important differences between these two criteria. 1. Negentropy, and other measures of nongaussianity, enable the deflationary, i.e., one-by-one, estimation of the independent components, since we can look for the maxima of nongaussianity of a single projection b T x . This is not possible with mutualinformation or most other criteria, like the likelihood. 2. A smaller difference is that in using nongaussianity, we force the estimates of the independent components to be uncorrelated. This is not necessary when using mutual information, because we could use the form in (10.5) directly, as will be seen in the next section. Thus the optimization space is slightly reduced. 224 ICABYMINIMIZATIONOFMUTUALINFORMATION 10.3 MUTUALINFORMATION AND LIKELIHOOD Mutualinformation and likelihood are intimately connected. To see the connection between likelihood and mutual information, consider the expectation of the log- likelihood in (9.5): 1 T E flog L(B)g = n X i=1 E flog p i (b T i x)g + log j det Bj (10.8) If the p i were equal to the actual pdf’s of b T i x , the first term would be equal to P i H (b T i x) . Thus the likelihood would be equal, up to an additive constant given by the total entropy of x , to the negative ofmutualinformation as given in Eq. (10.5). In practice, the connection may be just as strong, or even stronger. This is because in practice we do not know the distributions of the independent components that are needed in ML estimation. A reasonable approach would be to estimate the density of b T i x as part of the ML estimation method, and use this as an approximation of the density of s i . This is what we did in Chapter 9. Then, the p i in this approximation of likelihood are indeed equal to the actual pdf’s b T i x . Thus, the equivalency would really hold. Conversely, to approximate mutual information, we could take a fixed approxi- mation of the densities y i , and plug this in the definition of entropy. Denote the pdf’s by G i (y i )=logp i (y i ) . Then we could approximate (10.5) as I (y 1 y 2 ::: y n )= X i E fG i (y i )glog j det BjH (x) (10.9) Now we see that this approximation is equal to the approximation of the likelihood used in Chapter 9 (except, again, for the global sign and the additive constant given by H (x) ). This also gives an alternative method of approximating mutualinformation that is different from the approximation that uses the negentropy approximations. 10.4 ALGORITHMS FOR MINIMIZATIONOFMUTUALINFORMATION To use mutualinformation in practice, we need some method of estimating or ap- proximating it from real data. Earlier, we saw two methods for approximating mutual entropy. The first one was based on the negentropy approximations introduced in Section 5.6. The second one was based on using more or less fixed approximations for the densities of the ICs in Chapter 9. Thus, using mutualinformation leads essentially to the same algorithms as used for maximization of nongaussianity in Chapter 8, or for maximum likelihood estimation in Chapter 9. In the case of maximization of nongaussianity, the corresponding algorithms are those that use symmetric orthogonalization, since we are maximizing the sum of nongaussianities, so that no order exists between the components. Thus, we do not present any new algorithms in this chapter; the reader is referred to the two preceding chapters. EXAMPLES 225 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 −4 iteration count mutualinformation Fig. 10.1 The convergence of FastICA for ICs with uniform distributions. The value ofmutualinformation shown as function of iteration count. 10.5 EXAMPLES Here we show the results of applying minimizationofmutualinformation to the two mixtures introduced in Chapter 7. We use here the whitened mixtures, and the FastICA algorithm (which is essentially identical whichever approximation ofmutualinformation is used). For illustration purposes, the algorithm was always initialized so that W was the identity matrix. The function G was chosen as G 1 in (8.26). First, we used the data consisting of two mixtures of two subgaussian (uniformly distributed) independent components. To demonstrate the convergence of the al- gorithm, the mutualinformationof the components at each iteration step is plotted in Fig. 10.1. This was obtained by the negentropy-based approximation. At con- vergence, after two iterations, mutualinformation was practically equal to zero. The corresponding results for two supergaussian independent components are shown in Fig. 10.2. Convergence was obtained after three iterations, after which mutualinformation was practically zero. 10.6 CONCLUDING REMARKS AND REFERENCES A rigorous approach to ICA that is different from the maximum likelihood approach is given byminimizationofmutual information. Mutualinformation is a natural information-theoretic measure of dependence, and therefore it is natural to estimate the independent components by minimizing the mutualinformationof their estimates. Mutualinformation gives a rigorous justification of the principle of searching for maximally nongaussian directions, and in the end turns out to be very similar to the likelihood as well. Mutualinformation can be approximated by the same methods that negentropy is approximated. Alternatively, is can be approximated in the same way as likelihood. 226 ICABYMINIMIZATIONOFMUTUALINFORMATION 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 x 10 −3 iteration count mutualinformation Fig. 10.2 The convergence of FastICA for ICs with supergaussian distributions. The value ofmutualinformation shown as function of iteration count. Therefore, we find here very much the same objective functions and algorithms as in maximization of nongaussianity and maximum likelihood. The same gradient and fixed-point algorithms can be used to optimize mutual information. Estimation ofICAbyminimizationofmutualinformation was probably first proposed in [89], who derived an approximation based on cumulants. The idea has, however, a longer history in the context of neural network research, where it has been proposed as a sensory coding strategy. It was proposed in [26, 28, 30, 18], that decomposing sensory data into features that are maximally independent is useful as a preprocessing step. Our approach follows that of [197] for the negentropy approximations. A nonparametric algorithm for minimizationofmutualinformation was proposed in [175], and an approach based on order statistics was proposed in [369]. See [322, 468] for a detailed analysis of the connection between mutualinformation and infomax or maximum likelihood. A more general framework was proposed in [377]. PROBLEMS 227 Problems 10.1 Derive the formula in (10.5). 10.2 Compute the constant in (10.7). 10.3 If the variances of the y i are not constrained to unity, does this constant change? 10.4 Compute the mutualinformation for a gaussian random vector with covariance matrix C . Computer assignments 10.1 Create a sample of 2-D gaussian data with the two covariance matrices 3 0 0 2 and 3 1 1 2 (10.10) Estimate numerically the mutualinformation using the definition. (Divide the data into bins, i.e., boxes of fixed size, and estimate the density at each bin by computing the number of data points that belong to that bin and dividing it by the size of the bin. This elementary density approximation can then be used in the definition.) . space is slightly reduced. 224 ICA BY MINIMIZATION OF MUTUAL INFORMATION 10.3 MUTUAL INFORMATION AND LIKELIHOOD Mutual information and likelihood are intimately. using mutual information, which is a well-motivated information- theoretic measure of statistical dependence. One of the main utilities of mutual information