Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
440,89 KB
Nội dung
3
Gradients and Optimization
Methods
The main task in the independentcomponentanalysis (ICA) problem, formulated in
Chapter 1, is to estimate a separating matrix that will give us the independent
components. It also became clear that cannot generally be solved in closed form,
that is, we cannot write it as some function of the sample or training set, whose value
could be directly evaluated. Instead, the solution method is based on cost functions,
also called objective functions or contrast functions. Solutions to ICA are found
at the minima or maxima of these functions. Several possible ICA cost functions will
be given and discussed in detail in Parts II and III of this book. In general, statistical
estimation is largely based on optimization of cost or objective functions, as will be
seen in Chapter 4.
Minimization of multivariate functions, possibly under some constraints on the
solutions, is the subject of optimization theory. In this chapter, we discuss some
typical iterative optimization algorithms and their properties. Mostly, the algorithms
are based on the gradients of the cost functions. Therefore, vector and matrix
gradients are reviewed first, followed by the most typical ways to solve unconstrained
and constrained optimization problems with gradient-type learning algorithms.
3.1 VECTOR AND MATRIX GRADIENTS
3.1.1 Vector gradient
Consider a scalar valued function of variables
57
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
58
GRADIENTS AND OPTIMIZATION METHODS
where we have used the notation . By convention, we define
as a column vector. Assuming the function is differentiable, its vector gradient with
respect to is the -dimensional column vector of partial derivatives
.
.
.
(3.1)
The notation is just shorthand for the gradient; it should be understood that it
does not imply any kind of division by a vector, which is not a well-defined concept.
Another commonly used notation would be or .
In some iteration methods, we have also reason to use second-order gradients. We
define the second-order gradient of a function with respect to as
.
.
.
.
.
.
(3.2)
This is an matrix whose elements are second order partial derivatives. It is
called the Hessian matrix of the function . It is easy to see that it is always
symmetric.
These concepts generalize to vector-valued functions; this means an -element
vector
.
.
.
(3.3)
whose elements are themselves functions of .TheJacobian matrix of with
respect to is
.
.
.
.
.
.
(3.4)
Thus the th column of the Jacobian matrix is the gradient vector of with
respect to . The Jacobian matrix is sometimes denoted by .
For computing the gradients of products and quotients of functions, as well as of
composite functions, the same rules apply as for ordinary functions of one variable.
VECTOR AND MATRIX GRADIENTS
59
Thus
(3.5)
(3.6)
(3.7)
The gradient of the composite function
can be generalized to any number
of nested functions, giving the same chain rule of differentiation that is valid for
functions of one variable.
3.1.2 Matrix gradient
In many of the algorithms encountered in this book, we have to consider scalar-valued
functions
of the elements of an matrix :
(3.8)
A typical function of this kind is the determinant of .
Of course, any matrix can be trivially represented as a vector by scanning the
elements row by row into a vector and reindexing. Thus, when considering the
gradient of with respect to the matrix elements, it would suffice to use the notion
of vector gradient reviewed earlier. However, using the separate concept of matrix
gradient gives some advantages in terms of a simplified notation and sometimes
intuitively appealing results.
In analogy with the vector gradient, the matrix gradient means a matrix of the
same size as matrix , whose th element is the partial derivative of with
respect to . Formally we can write
.
.
.
.
.
.
(3.9)
Again, the notation
is just shorthand for the matrix gradient.
Let us look next at some examples on vector and matrix gradients. The formulas
presented in these examples will be frequently needed later in this book.
3.1.3 Examples of gradients
Example 3.1 Consider the simple linear functional of
, or inner product
60
GRADIENTS AND OPTIMIZATION METHODS
where is a constant vector. The gradient is, according to (3.1),
.
.
.
(3.10)
which is the vector
. We can write
Because the gradient is constant (independent of ), the Hessian matrix of
is zero.
Example 3.2 Next consider the quadratic form
(3.11)
where is a square matrix. We have
.
.
.
(3.12)
which is equal to the vector .So,
For symmetric , this becomes .
The second-order gradient or Hessian becomes
.
.
.
.
.
.
(3.13)
which is equal to the matrix .If is symmetric, then the Hessian of
is equal to .
Example 3.3 For the quadratic form (3.11), we might quite as well take the gradient
with respect to , assuming now that is a constant vector. Then .
Compiling this into matrix form, we notice that the matrix gradient is the
matrix .
Example 3.4 In some ICA models, we must compute the matrix gradient of the
determinant of a matrix. The determinant is a scalar function of the matrix elements
VECTOR AND MATRIX GRADIENTS
61
consisting of multiplications and summations, and therefore its partial derivatives are
relatively simple to compute. Let us prove the following: If is an invertible square
matrix whose determinant is denoted ,then
(3.14)
This is a good example for showing that a compact formula is obtained using the
matrix gradient; if were stacked into a long vector, and only the vector gradient
were used, this result could not be expressed so simply.
Instead of starting from scratch, we employ a well-known result from matrix
algebra (see, e.g., [159]), stating that the inverse of a matrix is obtained as
adj (3.15)
with adj the so-called adjoint of . The adjoint is the matrix
adj (3.16)
where the scalar numbers are the so-called cofactors. The cofactor is
obtained by first taking the submatrix of that remains when
the th row and th column are removed, then computing the determinant of this
submatrix, and finally multiplying by .
The determinant can also be expressed in terms of the cofactors:
(3.17)
Row can be any row, and the result is always the same. In the cofactors , none
of the matrix elements of the th row appear, so the determinant is a linear function
of these elements. Taking now a partial derivative of (3.17) with respect to one of the
elements, say, ,gives
By definitions (3.9) and (3.16), this implies directly that
adj
But adj is equal to by (3.15), so we have shown our result
(3.14).
This also implies that
(3.18)
see (3.15). This is an example of the matrix gradient of a composite function
consisting of the , absolute value, and functions. This result will be needed
when the ICA problem is solved by maximum likelihood estimation in Chapter 9.
62
GRADIENTS AND OPTIMIZATION METHODS
3.1.4 Taylor series expansions of multivariate functions
In deriving some of the gradient type learning algorithms, we have to resort to Taylor
series expansions of multivariate functions. In analogy with the well-known Taylor
series expansion of a function of a scalar variable ,
(3.19)
we can do a similar expansion for a function
of variables.
We h ave
(3.20)
where the derivatives are evaluated at the point . The second term is the inner
product of the gradient vector with the vector , and the third term is a quadratic
form with the symmetric Hessian matrix . The truncation error depends on the
distance ; the distance has to be small, if is approximated using only
the first- and second-order terms.
The same expansion can be made for a scalar function of a matrix variable. The
second order term already becomes complicated because the second order gradient is
a four-dimensional tensor. But we can easily extend the first order term in (3.20), the
inner product of the gradient with the vector , to the matrix case. Remember
that the vector inner product is defined as
For the matrix case, this must become the sum This
is the sum of the products of corresponding elements, just like in the vectorial inner
product. This can be nicely presented in matrix form when we remember that for any
two matrices, say, and ,
trace
with obvious notation. So, we have
trace (3.21)
for the first two terms in the Taylor series of a function of a matrix variable.
LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION
63
3.2 LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION
3.2.1 Gradient descent
Many of the ICA criteria have the basic form of minimizing a cost function
with respect to a parameter matrix , or possibly with respect to one of its columns
. In many cases, there are also constraints that restrict the set of possible solutions.
A typical constraint is to require that the solution vector must have a bounded norm,
or the solution matrix has orthonormal columns.
For the unconstrained problem of minimizing a multivariate function, the most
classic approach is steepest descent or gradient descent. Let us consider in more
detail the case when the solution is a vector ; the matrix case goes through in a
completely analogous fashion.
In gradient descent, we minimize a function iteratively by starting from
some initial point , computing the gradient of at this point, and then
moving in the direction of the negative gradient or the steepest descent by a suitable
distance. Once there, we repeat the same procedure at the new point, and so on. For
we have the update rule
(3.22)
with the gradient taken at the point . The parameter gives the length of
the step in the negative gradient direction. It is often called the step size or learning
rate. Iteration (3.22) is continued until it converges, which in practice happens when
the Euclidean distance between two consequent solutions goes
below some small tolerance level.
If there is no reason to emphasize the time or iteration step, a convenient shorthand
notation will be used throughout this book in presenting update rules of the preceding
type. Denote the difference between the new and old value by
(3.23)
We can then write the rule (3.22) either as
or even shorter as
The symbol is read “is proportional to”; it is then understood that the vector on the
left-hand side, , has the same direction as the gradient vector on the right-hand
side, but there is a positive scalar coefficient by which the length can be adjusted. In
the upper version of the update rule, this coefficient is denoted by . In many cases,
this learning rate can and should in fact be time dependent. Yet a third very convenient
way to write such update rules, in conformity with programming languages, is
64
GRADIENTS AND OPTIMIZATION METHODS
where the symbol means substitution, i.e., the value of the right-hand side is
computed and substituted in .
Geometrically, a gradient descent step as in (3.22) means going downhill. The
graph of is the multidimensional equivalent of mountain terrain, and we are
always moving downwards in the steepest direction. This also immediately shows
the disadvantage of steepest descent: unless the function is very simple and
smooth, steepest descent will lead to the closest local minimum instead of a global
minimum. As such, the method offers no way to escape from a local minimum.
Nonquadratic cost functions may have many local maxima and minima. Therefore,
good initial values are important in initializing the algorithm.
Local minimum
Gradient vector
minimum
Global
Fig. 3.1
Contour plot of a cost function with a local minimum.
As an example, consider the case of Fig. 3.1. A function is shown there as
a contour plot. In the region shown in the figure, there is one local minimum and one
global minimum. From the initial point chosen there, where the gradient vector has
been plotted, it is very likely that the algorithm will converge to the local minimum.
Generally, the speed of convergence can be quite low close to the minimum point,
because the gradient approaches zero there. The speed can be analyzed as follows.
Let us denote by the local or global minimum point where the algorithm will
eventually converge. From (3.22) we have
(3.24)
Let us expand the gradient vector element by element as a Taylor series around
the point , as explained in Section 3.1.4. Using only the zeroth- and first-order
terms, we have for the th element
LEARNING RULES FOR UNCONSTRAINED OPTIMIZATION
65
Now, because is the point of convergence, the partial derivatives of the cost
function must be zero at . Using this result, and compiling the above expansion
into vector form, yields
where is the Hessian matrix computed at the point . Substituting this
in (3.24) gives
This kind of convergence, which is essentially equivalent to multiplying a matrix
many times with itself, is called linear. The speed of convergence depends on the
learning rate and the size of the Hessian matrix. If the cost function is very
flat at the minimum, with second partial derivatives also small, then the Hessian is
small and the convergence is slow (for fixed ). Usually, we cannot influence the
shape of the cost function, and we have to choose
, given a fixed cost function.
The choice of an appropriate step length or learning rate is thus essential: too
small a value will lead to slow convergence. The value cannot be too large either: too
large a value will lead to overshooting and instability, which prevents convergence
altogether. In Fig. 3.1, too large a learning rate will cause the solution point to zigzag
around the local minimum. The problem is that we do not know the Hessian matrix
and therefore determining a good value for the learning rate is difficult.
A simple extension to the basic gradient descent, popular in neural network learn-
ing rules like the back-propagation algorithm, is to use a two-step iteration instead
of just one step like in (3.22), leading to the so-called momentum method. Neural
network literature has produced a large number of tricks for boosting steepest descent
learning by adjustable learning rates, clever choice of the initial value, etc. However,
in ICA, many of the most popular algorithms are still straightforward gradient descent
methods, in which the gradient of an appropriate contrast function is computed and
used as such in the algorithm.
3.2.2 Second-order learning
In numerical analysis, a large number of methods that are more efficient than plain
gradient descent have been introduced for minimizing or maximizing a multivariate
scalar function. They could be immediately used for the ICA problem. Their ad-
vantage is faster convergence in terms of the number of iterations required, but the
disadvantage quite often is increased computational complexity per iteration. Here
we consider second-order methods, which means that we also use the information
contained in the second-order derivatives of the cost function. Obviously, this infor-
mation relates to the curvature of the optimization terrain and should help in finding
a better direction for the next step in the iteration than just plain gradient descent.
66
GRADIENTS AND OPTIMIZATION METHODS
A good starting point is the multivariate Taylor series; see Section 3.1.4. Let us
develop the function in Taylor series around a point as
(3.25)
In trying to minimize the function , we ask what choice of the new point
gives us the largest decrease in the value of . We can write and
minimize the function with
respect to
. The gradient of this function with respect to is (see Example
3.2) equal to
; note that the Hessian matrix is symmetric. If the
Hessian is also positive definite, then the function will have a parabolic shape and the
minimum is given by the zero of the gradient. Setting the gradient to zero gives
From this, the following second-order iteration rule emerges:
(3.26)
where we have to compute the gradient and Hessian on the right-hand side at the
point .
Algorithm (3.26) is called Newton’s method, and it is one of the most efficient
ways for function minimization. It is, in fact, a special case of the well-known
Newton’s method for solving an equation; here it solves the equation that says that
the gradient is zero.
Newton’s method provides a fast convergence in the vicinity of the minimum,
if the Hessian matrix is positive definite there, but the method may perform poorly
farther away. A complete convergence analysis is given in [284]. It is also shown
there that the convergence of Newton’s method is quadratic;if is the limit of
convergence, then
where is a constant. This is a very strong mode of convergence. When the error on
the right-hand side is relatively small, its square can be orders of magnitude smaller.
(If the exponent is , the convergence is called cubic, which is somewhat better than
quadratic, although the difference is not as large as the difference between linear and
quadratic convergence.)
On the other hand, Newton’s method is computationally much more demanding
per one iteration than the steepest descent method. The inverse of the Hessian
has to be computed at each step, which is prohibitively heavy for many practical
cost functions in high dimensions. It may also happen that the Hessian matrix
becomes ill-conditioned or close to a singular matrix at some step of the algorithm,
which induces numerical errors into the iteration. One possible remedy for this is
[...]... encountered later in this book Several learning algorithms of principal componentanalysis networks and the well-known least-mean-squares algorithm, for example, are instantaneous stochastic gradient algorithms x x x x As x , with the elements Example 3.5 We assume that satisfies the ICA model = of the source vector statistically independent and the mixing matrix The problem is to solve and , knowing... Cx w are the corresponding eigenvalues The principal components of a random vector x are defined in terms of the eigenvectors, = Ef as discussed in Chapter 6 With a somewhat deeper analysis, it can be shown [324] that the only asymptotically stable fixed point is the eigenvector corresponding to the largest eigenvalue, which gives the first principal component The example shows how an intractable stochastic... because the randomness causes fluctuations that never die out unless they are deliberately frozen by letting the learning rate go to zero The analysis of stochastic algorithms like (3.32) is the subject of stochastic approximation; see, e.g., [253] In brief, the analysis is based on the averaged differential equation that is obtained from (3.32) by taking averages over on the right-hand side: the differential... points, i.e., roots of the right-hand side, because these are the points where the change in over time becomes zero It is also well-known how by linearizing the right-hand side with respect to a stability analysis of these fixed points can be accomplished Especially important are the so-called asymptotically stable fixed points that are local points of attraction Now, if the learning rate (t) is a suitably... is the eigenvector corresponding to the largest eigenvalue, which gives the first principal component The example shows how an intractable stochastic on-line rule can be nicely analyzed by the powerful analysis tools existing for ODEs 73 LEARNING RULES FOR CONSTRAINED OPTIMIZATION 3.3 LEARNING RULES FOR CONSTRAINED OPTIMIZATION w In many cases we have to minimize or maximize a function J ( ) under some... covered in [172] Constrained optimization has been extensively discussed in [284] Projection on the unit sphere and the short-cut approximation for normalization has been discussed in [323, 324] A rigorous analysis of the convergence of the stochastic on-line algorithms is discussed in [253] Problems @g 3.1 Show that the Jacobian matrix of the gradient vector @ w with respect to equal to the Hessian of g... function Formulate (a) the corresponding batch learning rule, (b) the averaged differential equation Consider a stationary point of (a) and (b) Show that if is such that the elements of are zero-mean and independent, then is a stationary point W 3.9 Assume that we want to maximize a function F w on the unit sphere, i.e., under the constraint kwk Prove that at the maximum, the gradient of F must point . in the independent component analysis (ICA) problem, formulated in
Chapter 1, is to estimate a separating matrix that will give us the independent
components Vector gradient
Consider a scalar valued function of variables
57
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright