Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
452,07 KB
Nội dung
Independent Component Analysis Aapo Hyvă rinen, Juha Karhunen, Erkki Oja a Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) Principal Component Analysis and Whitening Principal component analysis (PCA) and the closely related Karhunen-Lo` ve transe form, or the Hotelling transform, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364] Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible This goal is related to the goal of independent component analysis (ICA) However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only In connection with ICA, PCA is a useful preprocessing step The basic PCA problem is outlined in this chapter Both the closed-form solution and on-line learning algorithms for PCA are reviewed Next, the related linear statistical technique of factor analysis is discussed The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of firstand second-order statistics, which is very helpful as the first step in ICA 6.1 PRINCIPAL COMPONENTS The starting point for PCA is a random vector x with n elements There is available a sample x(1) ::: x(T ) from this random vector No explicit assumptions on the probability density of the vectors are made in PCA, as long as the first- and secondorder statistics are known or can be estimated from the sample Also, no generative 125 126 PRINCIPAL COMPONENT ANALYSIS AND WHITENING model is assumed for vector x Typically the elements of x are measurements like pixel gray levels or values of a signal at different time instants It is essential in PCA that the elements are mutually correlated, and there is thus some redundancy in x, making compression possible If the elements are independent, nothing can be achieved by PCA In the PCA transform, the vector x is first centered by subtracting its mean: x x E fxg The mean is in practice estimated from the available sample x(1) ::: x(T ) (see Chapter 4) Let us assume in the following that the centering has been done and thus E x = Next, x is linearly transformed to another vector y with m elements, m < n, so that the redundancy induced by the correlations is removed This is done by finding a rotated orthogonal coordinate system such that the elements of x in the new coordinates become uncorrelated At the same time, the variances of the projections of x on the new coordinate axes are maximized so that the first axis corresponds to the maximal variance, the second axis corresponds to the maximal variance in the direction orthogonal to the first axis, and so on For instance, if x has a gaussian density that is constant over ellipsoidal surfaces in the n-dimensional space, then the rotated coordinate system coincides with the principal axes of the ellipsoid A two-dimensional example is shown in Fig 2.7 in Chapter The principal components are now the projections of the data points on the two principal axes, e1 and e2 In addition to achieving uncorrelated components, the variances of the components (projections) also will be very different in most applications, with a considerable number of the variances so small that the corresponding components can be discarded altogether Those components that are left constitute the vector y As an example, take a set of 8 pixel windows from a digital image,an application that is considered in detail in Chapter 21 They are first transformed, e.g., using rowby-row scanning, into vectors x whose elements are the gray levels of the 64 pixels in the window In real-time digital video transmission, it is essential to reduce this data as much as possible without losing too much of the visual quality, because the total amount of data is very large Using PCA, a compressed representation vector y can be obtained from x, which can be stored or transmitted Typically, y can have as few as 10 elements, and a good replica of the original 8 image window can still be reconstructed from it This kind of compression is possible because neighboring elements of x, which are the gray levels of neighboring pixels in the digital image, are heavily correlated These correlations are utilized by PCA, allowing almost the same information to be represented by a much smaller vector y PCA is a linear technique, so computing y from x is not heavy, which makes real-time processing possible fg 127 PRINCIPAL COMPONENTS 6.1.1 PCA by variance maximization n X wk xk = wT x In mathematical terms, consider a linear combination y1 = k=1 x of the elements x1 ::: xn of the vector The w11 ::: wn1 are scalar coefficients or T weights, elements of an n-dimensional vector , and denotes the transpose of The factor y1 is called the first principal component of , if the variance of y1 is maximally large Because the variance depends on both the norm and orientation of the weight vector and grows without limits as the norm grows, we impose the constraint that the norm of is constant, in practice equal to Thus we look for a weight vector maximizing the PCA criterion w w w x w w w PCA T T T J1 (w1 ) = Efy1 g = Ef(w1 x)2 g = w1 EfxxT gw1 = w1 Cx w1 (6.1) so that kw1 k = (6.2) There Ef:g is the expectation over the (unknown) density of input vector x, and the norm of w1 is the usual Euclidean norm defined as kw1 k = (wT w1 )1=2 = C n X wk ] = k=1 2 The matrix x in Eq (6.1) is the n n covariance matrix of for the zero-mean vector by the correlation matrix x x (see Chapter 4) given Cx = EfxxT g (6.3) It is well known from basic linear algebra (see, e.g., [324, 112]) that the solution to the PCA problem is given in terms of the unit-length eigenvectors ::: n of the matrix x The ordering of the eigenvectors is such that the corresponding eigenvalues d1 ::: dn satisfy d1 d2 ::: dn The solution maximizing (6.1) is given by e C e w1 = e1 Thus the first principal component of x is y1 = eT x PCA The criterion J1 in eq (6.1) can be generalized to m principal components, with m any number between and n Denoting the m-th (1 m n) principal T component by ym = m , with m the corresponding unit norm weight vector, the variance of ym is now maximized under the constraint that ym is uncorrelated with all the previously found principal components: w x w Efym yk g = k < m: Note that the principal components ym have zero means because Efym g = T wmEfxg = (6.4) 128 PRINCIPAL COMPONENT ANALYSIS AND WHITENING The condition (6.4) yields: T T w x)(wk x)g = wmCxwk = T Efym yk g = Ef( m (6.5) For the second principal component, we have the condition that T T w2 Cw1 = d1w2 e1 = (6.6) because we already know that w1 = e1 We are thus looking for maximal variance T Efy2 g = Ef(w2 x)2 g in the subspace orthogonal to the first eigenvector of Cx The solution is given by w2 = e2 Likewise, recursively it follows that wk = ek Thus the k th principal component is yk = eT x k Exactly the same result for the wi is obtained if the variances of yi are maximized under the constraint that the principal component vectors are orthonormal, or wiT wj = ij This is left as an exercise 6.1.2 PCA by minimum mean-square error compression In the preceding subsection, the principal components were defined as weighted sums of the elements of with maximal variance, under the constraints that the weights are normalized and the principal components are uncorrelated with each other It turns out that this is strongly related to minimum mean-square error compression of , which is another way to pose the PCA problem Let us search for a set of m orthonormal basis vectors, spanning an m-dimensional subspace, such that the meansquare error between and its projection on the subspace is minimal Denoting again the basis vectors by ::: m , for which we assume x x x w w wiT wj = x ij Pm w xw T the projection of on the subspace spanned by them is i=1 ( i ) i The meansquare error (MSE) criterion, to be minimized by the orthonormal basis ::: m , becomes w m X(wT x)w k g PCA JMSE = Efkx i=1 i i (6.7) It is easy to show (see exercises) that due to the orthogonality of the vectors criterion can be further written as PCA JMSE = = xk2g Efk C trace( x ) Ef w m X(wT x) g wi, this (6.8) X wT C w (6.9) j =1 m j =1 j j x j PRINCIPAL COMPONENTS 129 It can be shown (see, e.g., [112]) that the minimum of (6.9) under the orthonormality condition on the i is given by any orthonormal basis of the PCA subspace spanned by the m first eigenvectors ::: m However, the criterion does not specify the basis of this subspace at all Any orthonormal basis of the subspace will give the same optimal compression While this ambiguity can be seen as a disadvantage, it should be noted that there may be some other criteria by which a certain basis in the PCA subspace is to be preferred over others Independent component analysis is a prime example of methods in which PCA is a useful preprocessing step, but once the vector has been expressed in terms of the first m eigenvectors, a further rotation brings out the much more useful independent components It can also be shown [112] that the value of the minimum mean-square error of (6.7) is w e e x PCA JMSE = n X di (6.10) i=m+1 e the sum of the eigenvalues corresponding to the discarded eigenvectors m+1 If the orthonormality constraint is simply changed to wjT wk = !k jk ::: en (6.11) where all the numbers !k are positive and different, then the mean-square error problem will have a unique solution given by scaled eigenvectors [333] 6.1.3 Choosing the number of principal components From the result that the principal component basis vectors x , it follows that C wi are eigenvectors ei of e xxT emg = eT Cxem = dm m Efym g = Ef T m (6.12) The variances of the principal components are thus directly given by the eigenvalues of x Note that, because the principal components have zero means, a small eigenvalue (a small variance) dm indicates that the value of the corresponding principal component ym is mostly close to zero An important application of PCA is data compression The vectors in the original data set (that have first been centered by subtracting the mean) are approximated by the truncated PCA expansion C x ^ x= m X yiei (6.13) i=1 ^ Then (6.10) that the equal to Pn mwe dknow fromeigenvalues are allmean-square error Efkx xk g ismore and positive, the error decreases when i As the i = +1 more terms are included in (6.13), until the error becomes zero when m = n or all the principal components are included A very important practical problem is how to 130 PRINCIPAL COMPONENT ANALYSIS AND WHITENING choose m in (6.13); this is a trade-off between error and the amount of data needed for the expansion Sometimes a rather small number of principal components are sufficient Fig 6.1 Leftmost column: some digital images in a 32 32 grid Second column: means of the samples Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64 principal components were used in the expansion Example 6.1 In digital image processing, the amount of data is typically very large, and data compression is necessary for storage, transmission, and feature extraction PCA is a simple and efficient method Fig 6.1 shows 10 handwritten characters that were represented as binary 32 32 matrices (left column) [183] Such images, when scanned row by row, can be represented as 1024-dimensional vectors For each of the 10 character classes, about 1700 handwritten samples were collected, and the sample means and covariance matrices were computed by standard estimation methods The covariance matrices were 1024 1024 matrices For each class, the first 64 principal component vectors or eigenvectors of the covariance matrix were computed The second column in Fig 6.1 shows the sample means, and the other columns show the reconstructions (6.13) for various values of m In the reconstructions, the sample means have been added again to scale the images for visual display Note how a relatively small percentage of the 1024 principal components produces reasonable reconstructions 131 PRINCIPAL COMPONENTS The condition (6.12) can often be used in advance to determine the number of principal components m, if the eigenvalues are known The eigenvalue sequence d1 d2 ::: dn of a covariance matrix for real-world measurement data is usually sharply decreasing, and it is possible to set a limit below which the eigenvalues, hence principal components, are insignificantly small This limit determines how many principal components are used Sometimes the threshold can be determined from some prior information on the vectors x For instance, assume that x obeys a signal-noise model x= a m X aisi + n (6.14) i=1 where m < n There i are some fixed vectors and the coefficients si are random numbers that are zero mean and uncorrelated We can assume that their variances have been absorbed in vectors i so that they have unit variances The term is white noise, for which Ef T g = Then the vectors i span a subspace, called the signal subspace, that has lower dimensionality than the whole space of vectors The subspace orthogonal to the signal subspace is spanned by pure noise and it is called the noise subspace It is easy to show (see exercises) that in this case the covariance matrix of has a special form: nn a I n a x Cx = P m X aiaTi + i=1 Pm x I (6.15) aa The eigenvalues are now the eigenvalues of i=1 i T , added by the constant i m But the matrix i=1 i T has at most m nonzero eigenvalues, and these correspond i to eigenvectors that span the signal subspace When the eigenvalues of x are computed, the first m form a decreasing sequence and the rest are small constants, equal to : aa C d1 > d2 > ::: > dm > dm+1 = dm+2 = ::: = dn = It is usually possible to detect where the eigenvalues become constants, and putting a threshold at this index, m, cuts off the eigenvalues and eigenvectors corresponding to pure noise Then only the signal part remains A more disciplined approach to this problem was given by [453]; see also [231] They give formulas for two well-known information theoretic modeling criteria, Akaike’s information criterion (AIC) and the minimum description length criterion (MDL), as functions of the signal subspace dimension m The criteria depend on the length T of the sample (1) ::: (T ) and on the eigenvalues d1 ::: dn of the matrix x Finding the minimum point gives a good value for m C 6.1.4 x x Closed-form computation of PCA w e C To use the closed-form solution i = i given earlier for the PCA basis vectors, the eigenvectors of the covariance matrix x must be known In the conventional use of 132 PRINCIPAL COMPONENT ANALYSIS AND WHITENING PCA, there is a sufficiently large sample of vectors x available, from which the mean and the covariance matrix x can be estimated by standard methods (see Chapter 4) Solving the eigenvector–eigenvalue problem for x gives the estimate for There are several efficient numerical methods available for solving the eigenvectors, e.g., the QR algorithm with its variants [112, 153, 320] However, it is not always feasible to solve the eigenvectors by standard numerical methods In an on-line data compression application like image or speech coding, the data samples (t) arrive at high speed, and it may not be possible to estimate the covariance matrix and solve the eigenvector–eigenvalue problem once and for all One reason is computational: the eigenvector problem is numerically too demanding if the dimensionality n is large and the sampling rate is high Another reason is that the covariance matrix x may not be stationary, due to fluctuating statistics in the sample sequence (t), so the estimate would have to be incrementally updated Therefore, the PCA solution is often replaced by suboptimal nonadaptive transformations like the discrete cosine transform [154] C C e x x C 6.2 PCA BY ON-LINE LEARNING Another alternative is to derive gradient ascent algorithms or other on-line methods for the preceding maximization problems The algorithms will then converge to the solutions of the problems, that is, to the eigenvectors The advantage of this approach is that such algorithms work on-line, using each input vector (t) once as it becomes available and making an incremental change to the eigenvector estimates, without computing the covariance matrix at all This approach is the basis of the PCA neural network learning rules Neural networks provide a novel way for parallel on-line computation of the PCA expansion The PCA network [326] is a layer of parallel linear artificial neurons T shown in Fig 6.2 The output of the ith unit (i = ::: m) is yi = i , with denoting the n-dimensional input vector of the network and i denoting the weight vector of the ith unit The number of units, m, will determine how many principal components the network will compute Sometimes this can be determined in advance for typical inputs, or m can be equal to n if all principal components are required The PCA network learns the principal components by unsupervised learning rules, by which the weight vectors are gradually updated until they become orthonormal and tend to the theoretically correct eigenvectors The network also has the ability to track slowly varying statistics in the input data, maintaining its optimality when the statistical properties of the inputs not stay constant Due to their parallelism and adaptivity to input data, such learning algorithms and their implementations in neural networks are potentially useful in feature detection and data compression tasks In ICA, where decorrelating the mixture variables is a useful preprocessing step, these learning rules can be used in connection to on-line ICA x w wx x PCA BY ON-LINE LEARNING Input vector w1 133 x ? ? ? ? ? ? Q c @ C Q c @ S A S#A C QS A # @ Qc@ C S A C # # @ S c@ S A C QA C Qc@S A C # SA C @ Q # S cc @ A CQ@ A C S # S @AC Q@AC S # c Q C # c S Q @ C A m @ 1A mS w ? wT x w T wm ? x ? T wm x Fig 6.2 The basic linear PCA layer 6.2.1 The stochastic gradient ascent algorithm w In this learning rule, the gradient of y1 is taken with respect to and the normalizing constraint k k = is taken into account The learning rule is w w1(t + 1) = w1(t) + (t) y1 (t)x(t) y1 (t)w1 (t)] with y1 (t) = w1 (t)T x(t) This is iterated over the training set x(1) x(2) :::: The parameter (t) is the learning rate controlling the speed of convergence In this chapter we will use the shorthand notation introduced in Chapter and write the learning rule as w1 = (y1x y1 w1) (6.16) The name stochastic gradient ascent (SGA) is due to the fact that the gradient is not with respect to the variance Efy1 g but with respect to the instantaneous random value In this way, the gradient can be updated every time a new input vector becomes y1 available, contrary to batch mode learning Mathematically, this is a stochastic approximation type of algorithm (for details, see Chapter 3) Convergence requires that the learning rate is decreased during learning at a suitable rate For tracking nonstationary statistics, the learning rate should remain at a small constant value For a derivation of this rule, as well as for the mathematical details of its convergence, see [323, 324, 330] The algorithm (6.16) is often called Oja’s rule in the literature Likewise, taking the gradient of yj with respect to the weight vector j and using the normalization and orthogonality constraints, we end up with the learning rule wj = yj x w yj j X i wi ] i