Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 32 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
32
Dung lượng
2,48 MB
Nội dung
Stable Adaptive Control and Estimation for Nonlinear Systems: Neural and Fuzzy Approximator Techniques Jeffrey T Spooner, Manfredi Maggiore, Ra´ l Ord´ nez, Kevin M Passino u o˜ Copyright 2002 John Wiley & Sons, Inc ISBNs: 0-471-41546-4 (Hardback); 0-471-22113-9 (Electronic) Chapter Optimization 4.1 for Training Approximators Overview As humans, we are intuitively familiar with the process of optimization because of our constant exposure to it For instance, in business investments we seek to maximize our profits; in recreational games we seek to maximize our own score or minimize that of our opponent It is not surprising that optimizaStion plays a key role in engineering and many other fields In circuit design we may want to maximize power transfer, in motor design we may want to design for the highest possible torque delivery for a given amount of current, or in communication system design we may want to minimize the probability of error in signal transmission Indeed, in the design of control systems we have the field of “optimal control,” where one objective might be to minimize tracking error and control effort (energy) while stabilizing a system Here, as in many adaptive control methods, the adaptive schemes are designed to search for a parameter set which minimizes a cost function, while maintaining, or seeking to achieve, certain closed-loop properties (e.g., stability) of the adaptive system For insta#nce, we may seek to adjust the parameters of a neural network or fuzzy system (which we treat as “approximators”) so that the neural network or fuzzy system approximator nonlinearity matches that of the pla#nt, and then this synthesized nonlinearity is used to specify a controller that reduces the tracking error OptimizaStion then forms a fundamental foundation on which all the approaches rest It is for this reason that we provide an introduction to optimization here The reader who is already familiar with optimization methods can skip (or skim) this chapter and go to the next one 73 74 4.2 Optimization Problem Formulation Consider the minimization of the cost function J(B) > where E S C RP is a vector of p adjustable parameters In other words, we wish to find some 0* E S such that 8” = argminJ(B) (4.1) @ES This type of optimization problem is referred to as “constrained optimization” since we require that E S When S = R”, the minimization problem becomes “unconstrainted.” If we wish to find a parameter set that shapesa function F(x, 0) (that represents a neural network or fuzzy system with tunable parameters 0) so that ?@,8) and f(z) match at II: = 2, then one might try to minimize the cost function J(B) = If(z) - qqe>l” (4.2) by adjusting If we wish to cause 7=&O) to match f(x) J: E S,, then minimization of J(B) = sup If(x) - F-(x$))” XES, on the region (4.3) would be one possible cost function to consider Minimizing the difference between a known parameterized function (an “approximator”) F(x, 6) and another function f(x) which is in general only partially known is referred to as “function approximation.” This special optimization problem will be of particular interest to us throughout the study of adaptive systems using fuzzy systems and neural networks Practically speaking, however, in our adaptive estimation and control problems we are either only given a finite amount of information in the form of input-output pairs about the unknown function f(z) or we are given such input-output pairs one at a time in a sequence Suppose that there are n input va,riables so x = [xi, ,x,lT E R’“ Suppose we present the function f(z) with a variety of input data (specific values of the variable x) and collect the outputs of the function Let the ith input vector of data be denoted by i x2 = [Xi) - ) XJ T , where x2 E 5’Z and denote the output of the function by yi = f(xi) Furthermore, let the “training data set” be denoted by G={(xi,yi):i=l,2 , , M} Sec 4.2 Probtem Formulation 75 Given this, a pra.ctical cost function to minimize is given by (4.4) We will study several ways to minimize this cost function, keeping in mind t)ha#t we would like to be minimizing a function like the one given in (4.3) art the same time (i.e., even though we can only minimize (4.4) we want to obtain accurate function approximation over a whole continuous range of variables x E S,) Clearly, if G does not contain a sufficient number of samples within S,, it will not be possible to a good job at lowering the va#lue of (4.3) For instance, you often need some type of “coverage” of the input space SX by the xi data (e.g., uniform on a grid with a small distance between the points) The problem is, however, that in practice you often not have a choice of how to distribute the data over S,; often you are forced to use a given G directly as it is given to you (and you cannot change it to improve approximation accuracy) We see that due to issues with how G is constructed, the problem of function approximation, specifically the minimization of the magnitude of the “approximation error” e(z) = f (2) - WG e> is indeed a difficult problem What function approximation and optimza,tion have to with adaptive control? As a simple example, suppose we wish to drive the output of the system defined by Li = f(x) +u to zero, where f(x) is a smooth but unknown function, system output, and u is the input Consider the function approximates the unknown function f(x), where 0* is a pa,ra,meters for the cost function (4.3) Then the controller U= -3y~,e*) - kx, (4.5) x is the scalar F(x, 8*), which vector of ideal defined by (4 6) where Ic > would drive 1x1 -+ assuming that IF(x, 0*) - f (x)1 = This happens beca’usethe closed-loop dynamics become i = -Icx which is an exponentially stable system In general, it will be our task to find = 8* so that the approximator qs,e) = qx,e*) E f cx) Not ice that even in this simple problem, some key issues with trying to find = 0* are present when the cost function (4.4) is used in place of (4.3) For instance, to generate the training data set G we need to assume that we know (can measure) x However, even though we know U, we not necessarily know f(x) unless we can also 76 Optimization assume that we know li: (which can be difficult we know L? then we can let to measure due to noise) If f( x1 =25-u Even though under these assumptions we can gather data pairs (x’, f (xi)) to form G for training the approximator, we cannot in general pick the xi unless we can repeatedly initialize the differential equation in (4.5) with every possible xi E S, Since in many practical situations there is only one initia,l condition (or a finite number of them) that we can pick, the only data we can gather are constrained by the solution of (4.5) This presents us with the following problem: How can we pick the u(t) so that the data we cancollect to put in G will ensure good function approximation? Note that the “controllability” of the uncertain nonlinear system in (4.5) will impact our ability to “steer” the state to regions in 2& where we need to improve a,pproximation accuracy Also, the connection between G and approximation accuracy depends critically on what optimization algorithm is used to construct 8, as well as on the approximator’s structural potential to match the unknown function f(x) We see that even for our simple scalar problem, a guarantee of approximation accuracy is difficult to provide Often, the central focus is on showing that even if perfect approximation is not achieved, we still get a stable closed-loop system In fact, for our example even if F(x, 0) does not exactly match f(x), the resulting closed loop system dynamics are stable and converge to a ball around zero assuming that an 6’ can be found < such that If(x) - F(x, O)l - D for all x, where D > (see Homework problem (4.7)) The main focus of this chapter is to provide optimization algorithms for As the above example constructing so that F(x, 0) approximates f(z) illustrates, the end approximation accuracy will not be paramount We simply need to show that if we use the optimization methods shown in this chapter to adjust the approximator, then the resulting closed-loop system will be stable The size of the approxima,tor error, however, will typically a,ffect the performance of the closed-loop system 4.3 Linear Least Squares We will first concentrate on solving the least squares problem for the ca,se where J(B) = FWi i=l jf(Xi)-3(Xi,8)/2) P-7) Sec 4.3 Linear Least Squares 77 where wi > are some scalars, is an approximator defined by f(x) is an unknown F(x,e) = $9 function, and F(x, 0) P-8) so that appears linearly (i.e., a “linear in the parameters” approximator which we will sometimes call a linear approximator) In this chapter, we will use the short hand CT = 6F/%? We will later study techniques which consider some F(x, 0) such that B does not necessarily appear linearly in the output of the approximator Next, we will introduce batch and recursive least squares methods to find = 0* which minimizes the cost function (4.7) for input-output data in G assuming the approximator has the form of (4.8) 4.3.1 Batch Least Squares We will introduce the batch least squares method to train linear approximators by first discussing the solution of the linear system identification problem Let f denote the physical system that we wish to identify The training set G is defined by the experimental input-output data that is generated from this system In linear system identification, we can use a model P y(k) = ~‘,iy(~ - i) + xebiu(Jc - i), i=l i=O where u(k) and y(k) are the system input and output at time k This form of a system model is often referred to as an ARMA (AutoRegressive Moving Average) model In this case the approximator y(k) = F(x,Q) is defined with We have n = q+p+ so that c(k) and are n x vectors, and often T cc > be an M x n matrix that consists of the Ci data vectors matrix (i.e., the Ci such that (Ci, yi) E G) Let stacked into a ci = yi - ((y(j be the error in approximating the data pair (ci, y”) E G using Define E= [2,c2, ,P]T so that E = Y - W Now choose J(c)) = ZETE to be a measure of how good the approximation is for all tlhe data, for a given 8, which is (4.7) with wi = for i = 1, M We want to pick 6’ to minimize J(O) Notice that J(6) is convex in so that a local minimum is a global minimum in this case Using basic ideas from calculus, if we take the partial derivative of J with respect to and set it equal to zero, we get an equation for the best estimate (in the least squares sense) of the unknown Another approach to deriving this result is to notice that Then, we “complete letting 2J = the square” by assuming YTY - YTfW - BTaTY + YT@(iPTS)-l@TY = YT(I + OTtDTW the same terms at the end of - qfDTq-l@T)Y + (0 - (aTq-l@TY)T@TQ,(e and - YTq@T@)-l@TY (where we are simply adding and subtracting the equation) Hence, 2J that aT@ is invertible - (@T@)-‘a5TY) Sec 4.3 Linear Least 79 Squares Since the first term in this equation is independent of 0, we cannot reduce J via this term, so it can be ignored Thus, to get the smallest value of J, we choose so that the second term is equal to zero since its contribution is a non-negative value We will let 0” denote the value of that achieves the minimization of J, a,nd we notice that 8’ = (@Tq-laTY, (4-g) since the smallest we can make the last term in the above equation is zero This is the equation for batch least squares that shows we can directly compute the least squares estimate 8* from the “batch” of data that is loa’ded into + and Y If we pick the inputs to the system so that it is “sufficiently excited” [135], then we will be guaranteed that ipT@ is invertible (rank(@) = n); furthermore, if the data come from a linear plant with known CJand p, then for sufficiently large M we will achieve perfect estimation of the plant parameters In “weighted” batch least squares we use J(e) = ~E~WE, (4.10) where, for example, W is an M x M diagonal matrix with its diagonal elements wi > for i = 12, , M These wi can be used to weight the importance of certain elements of G more than others For example, we may choose to have it put less emphasis on older data by choosing w1 < w2 < - < we when x2 is collected after x1, x3 is collected after x2, and so on The resulting parameter estimates can be shown to be given by 8” = (fPTwq-‘@TwY (4.11) To show this, simply use (4.10) and proceed with the derivation in the same manner as above 4.1 As a very simple example of how batch least squares can be used, suppose that we would like to identify the coefficients for the system (4.12) !I@) = QaY(~ - 1) + f&J@), Example where [(lc) = [g(k - l), z@)]‘ Suppose that the data that we would like to fit the system to is given by G={([ :I+([ :]+([ $4))) so that M = We will use (4.9) to compute the parameters for the solution that best fits the data (in the sensethat it will minimize the 80 Optimization sum of the squared distances data) To this we let between the identified system and the 1 and i Y= Hence, I 1)-l[ :“I = [ : ] ’ and the system best fits the data in the least squares sense The same general approach works for larger data sets A 4.3.2 Recursive Least Squares While the batch least squares approach has proven to be very successful for a variety of applications, the fa#ctthat by its very nature it is a “batch” method (i.e., all the data are gathered, then processing is done) may present computation al problems For small M we could clearly repeat the batch calculation for increasingly more data as they are gathered, but as M becomes larger the computations become prohibitive due to the fact that the dimensions of @ and Y depend on n/r Next, we derive a recursive version of the batch least squares method that will allow us to update our estimate of 8* each time we get a new data pair, without using all the old data in the computation and without having to compute the inverse of iPT4e Since we will be successively increasing the size of G, and since we will assumethat we increase the size by one each time step, we let a time index i? = M and i be such that - i < Jc Let the 72x n matrix < -1 P(k) = pTq)-’ = (4.13) &y,, ( i=l ) and let 6(lc - 1) denote the least squaresestimate based on k - data pairs (P(k) is called the “covariance matrix”) Assume that QiT+ is nonsingular Sec 4.3 Linear Least 81 Squares for all k We have P-r (k) = aT@ = X:=1 Ci(: ciyi - [k([k)T)e(k i=l Using the result from (4.15), this gives us 6(k) = P(k)(P-‘(k) = B(k - 1) - P(k)~“(~“)Te(k - * * - 7M, for which yi = f(xi) are to be learned, then we may either adjust the approximator parameters on a single pair (x’, yi) at a time (series updating) or based upon the entire collection of data pairs (batch updating) Series updating is accomplished by selecting a pair (xi, yi), where i E , M j is a random integer chosen at each iteration, and then using -cl Euler’s first order approximation of (4.26) so that the parameter update is defined by (4.40) qrc + 1) = B(lc) + &iT(+?(k), where Ic is the iteration step, e(k) = y” - F(xi,O@)), and T c” P4 - dF(xi, x) a,2 2=0(k) We have absorbed the length of the sampling interval into the adaptation gain q Since a random presentation of the data pairs is used, the value of tends towards a value with minimizes c,“=,(e’>” on average A second approach is to use a discretized version of (4.31) so that all the data pairs a,re considered at once An Euler approximation gives the update law e(k + 1) = O(k)+ q This is often referred to as a gradient update law or batch propagation in the neural network community back Sec 4.4 Nonlinear Least Squares 93 In the derivation of the continuous gradient-based update laws, the learning rate, 7, was allowed to be any positive number Using a discrete update law, however, if v is made to be too la.rge, then the parameter error may converge slowly, oscillate about a fixed value, or diverge, all of which one would like to avoid when updating parameters The following example shows that if > is chosen too large, then the discrete gradient-based algorithms may no longer provide appropriate values for 4.6 Consider the case where the desired approximator output is y = F(x, 6*) where 0* is a set of ideal parameters which cause the output of the approximator to be y1 whenever the input x1 is presented (considering the casewhere only a single data pair is to be learned) The output error is Example = y1 - F-(x1, B(k)) (4.42) = e(k) qx’,s*> (4.43) - F(x’,O(k)), where e(rC)is the current estimate of 8” Defining the parameter error as B(k) = B(k) - e*, a linear representation may be expressed as e(k) = -cT8(k) + S(x’, 0, P) (4.44) Here I&(x’,@$*)l L@“(, with L > a finite constant, is the error in representing e(k) by a linear expression (more details on the derivation of will be provided in Chapter 5) Here we will assume that we initialize such that Ifi( is small and thus 16(x1, 0, @*)I z To show that the learning rate, 7, needs to be limited for the discrete case, consider the parameter error metric V(k) = fiT(k)8(k) (if + 0, then e(k) -+ 0) The change in V(k) during an update of the weights is V(k + 1) - V(k) = eT(k + l)@k + 1) - 8(k)T6(k) Substituting in the update law (4.49, V(k + I) - V(k) = 2q@(k)@)e(k) + $eT(k)CT(k)C(k)e(k), where we have used O(k + 1) = t?(k + 1) Since FZ0, we ha’ve V(k + 1) - V(k) ==: -wT(k> [aI- vST(k>C(k>] e(k)* (4.45) ThusifO< Xmin(21-q[T(k)C(k)) for all k, then V(k+ 1) -V(k) As q becomes large, however, the boundedness of the approximator output error is no longer guaranteed (that is, the algorithm can become unstable because it will not be the case that < X,i,(21 so that V(k) increases with k) A rlsT(k)s(w Optimization 94 This example shows that one must be careful not to ask for too large of an adaptation rate when dealing with discrete updates We will see later that this is also true for discrete-time adaptive control systems 4.4.4 Constrained Optimization So far, we have not placed any restrictions upon the possible values of 0* In some situations, however, we may have a priori knowledge about the fea,sible values for 6* In this case, a constrained optimzation approach may be taken as we discuss next Figure 4.6 Constrained optimization using a projection algorithm If it is known that the ideal parameters 0* belong to a convex set C, then it is possible to modify the above adaptation routines to ensure that they remain within C We will, in particular, consider the use of a “projection algorithm.” Figure 4.6 shows how the projection algorithm works If the parameters are within C, then the trajectory defined by is not changed If reaches the boundary of C (denoted by B), however, then b must be modified such that will not leave C and in particular so that it stays on B until it moves toward the interior of C If we are using, for exa.mple, an update law defined by = qv(t) where > 0, then this may be redefined to incorporate the projection as fj= Pr(qe) rlv if E B and uTb~ > otherwise, (4.46) where Pr(x) is the projection of x onto the the hyperplane tangent to B at and bl is the unit vector perpendicular to the hyperplane pointing outward Sec 4.4 Nonlinear Least Squares 95 at In this way, only the component of the update which does not move outside of C is used in the update If a cost function (or Lyapunov function in the study of adaptive systems) defined by V = eTe is used in the stability analysis of the update algorithm with fi = - 6*, then the stability results are unaffected by the projection This is because if we modify so that does not move outside C, then 10 - 6*1 is smaller because of the projection since 6* E C When V = 81TP8 where P > 0, the projection algorithm must be changed slightly to ensure that V decreases at least as fast as the case when no projection is used First we let so that V = sTPl that V = gT8 A also transforming linear, @ will still PITH Using the change of coordinates # = PIT& notice standard projection algorithm may now be used for by C to @ and B to B Since the transformation # = PIT8 is be convex 4.7 Suppose that we wish to design a projection algorithm for the case where E C, with Example c = (0 = [O,, BplT E RP : bi < ci,for i = 1, ,p> , (4.47) so that each element of is confined to an interval To this, as we update the Oi, if bi et ci then you use the update generated by the gradient method However, if the update law tells you that the parameter & should go outside the interval, then you place its value on the interval edge Moreover, if the value of 0i lies on either edge of the interval and the update law says the next value of the parameter should be in the interval then the update law is allowed to place it there Clearly such a projection law works for both continuous and discrete time gradient update laws and it is very easy to implement in code A 4.4.5 Line Search and the Conjugate Gradient Method Control algorithms that use an approximator with parameters that are modified in real time are referred to as adaptive or on-line approximation techniques The adaptive estimation and control methods presented later in this book use the least squares and gradient methods presented earlier in this chapter, and these will be shown to provide stable operation for on-line estimation and control methods In this section, we will depart from the main focus of the book to focus on off-line training (optimization) techniques These methods ca(n be useful for constructing (nonadaptive) 96 Optimization estimators, and for a priori training of approximators that will later be used in an on-line fashion (e.g., in indirect adaptive control) For the offline training of the approxima#tor ?(x, 0) to match some unknown nonlinear function f(z) we will not be concerned here with how F(x, 0) will cause some system dynamics to behave; here we are only concerned with adjusting to make F(z,0) match f(x) as closely as possible For example, the Levenberg-Marquardt and Conjugate Gradient optimization methods are popular approaches for neural network and fuzzy system training Here, we will discuss a line search method and the Conjugate Gradient method Line Search When the optimization problem is reduced to a single dimension, a number of techniques may be used to efficiently find a minimum along the search dimension Ea,ch of these typically requires that a minimum be bracketed such that given points a < b < c, we have J(b) < J(a) and J(b) < J(c) so that one or more minimum exists between a and c Once the minimum section search, which is has been bracketed, a routine such as the golden outlined below, may be used to iteratively find its location: Choose values a < b < c such that J(b) < J(a) and J(b) < J(c) Let R = 0.38197 (the “golden ratio”) If Ic - b( > lb - al, then let tr = b and t2 = b + R(c - b), otherwise let ’ tr = b - R(b - a) and t2 = b If J(t2) < J(ti), then let a = t 1, tl = t2, and t2 = t2 + R(c - tz), otherwise c = t2, t2 = tl, and tl = tl - R(tl - a) If Ic - UI > tol, go to step If J(tl) < J(t2), then return tl, otherwise return t2 There exists a number of other line minimization routines such as Brent’s algorithm, which ma.y provide improved convergence See [181] for further discussion Example 4.8 Consider the minimization of the function y = (x - 1)” + 1, (4.48) which is minimized with x = The golden section search may be used in the minization given some initial bra,cketing values a, b, and c Choosing a = 0, b = 0.1, and c = 10, the golden section search is able to minimize (4.48) Figure 4.7 shows the progression of the bracketing values a, t 1, t2, and c Notice that the algorithm converges tox=1 A Sec 4.4 Nonlinear Least 97 I 10 Squares I I -\ \ I I I I \ I I I x - I \ \ I I L - \ \ - \ \ \ ‘\ / / / / ’ \ \ \ \ \ \ \ \ L -\ -.- - C -_ - c - y.-,.,,-q Y,i-r3 I Conjugate / ” ‘1 \ 1‘ Figure The t”l t2 \ \ \ I I \ I - _ _ \ I I 4.7 Gradient iteration I I 10 12 14 , Bracketing in the golden section search Method Consider the general minimization of J = &eqTej (4.49) j=l If ej = (yj) - F(&8), then (4.50) If B(lc) is a guessof 6* (the value which minimizes J), then let (4.51) be the “search direction.” Since d(k) is along the negative gradient, J will decrease as we move along B(lc) + qd(k) where q > is the search length with B(k) and c!(k) held constant In fact, J will continue to decrease until 98 Optimization the gradient in the search direction becomes zero (the minimum when the gradient goes to zero) That is, until occurs (4.52) Once the minimum along d is found, the procedure is repeated with B(lc + 1) = e(k) + 7$(k) until J converges to some desired value or no longer decreases This is called the method of “steepest descent.” Figure 4.8 Staircase updating of the steepest descent optimization rou- tine If a new d(k) is chosen in the negative gradient direction, we see that each search direction is orthogonal to the previous direction since the change in the gradient along the previous direction was exactly zero when we stopped the line search The weights are thus modified in a staircase fashion toward a minimum of J as shown in Figure 4.8 If J(B) consists of long narrow va.lleys, then the steepest descent algorithm causes the minimization to proceed by taking a number of steps, often repeating directions as we move toward a minimum Rather than taking orthogonal steps each time, which are not independent of one another, it is desirable to move in new directions which not redo the minimization which has already been completed This concept is known as moving in “conjugate directions.” To see how this is accomplished, consider the Taylor expansion of our cost function J(6) given by J(o) = J(@o) + (0 - &I)~< + $6 - 60)THo(8 - 0,) + h.o.t., (4.53) where [ = ~J/%JQ,Q, and Ho = Hlo=o, with H = [hij] and hij = d” Jld0idOj is the “Hessian matrix.” If J(B) is quadratic, then it has a global minimum at dJ/i381~,~* = Ignoring the higher order terms (“h.o.t.“) in (4.53), note that 13JT = If we have just moved in the direction direction u, we desire that the direction d and now want to move in the be “conjugate” so that If all the search directions for a set of vectors are conjugate, then it is said to be a conjugate set The conjugate gradient method finds successively conjugate search directions without needing to calculate the Hessian In particular, the Fletcher-Reeves-Polak-Ribiere conjugate gradient method is given as follows: Calculate [(lc) Set the search direction Find B(j? + 1) which minimizes line minimization) Calculate equal to d(k) = -c(k) J(6) along d(k) [(lc + 1) If 18(k: + 1) - 6(k$l < to1 then return B(lc + 1) Set d(k + 1) = -C(k + 1) + qd(k), where (this is achieved via 100 Optimization Set IC= AJ and goto + Though a number of alternative optimization methods exists, the above algorithm is suggested for general purpose off-line learning of the approximator parameters when the gradients exist 4.9 Here, we will apply the method of conjugate gradients to find a parameter set to minimize the cost function (4.49) Consider learning the function Example (4.59) using the approximator (a radial basis function neural network) (4.60) i=l where are adjustable parameters and ci and cr are assumedto be fixed in this example Here, M = 100 data points were taken from a normal distribution about II; = The Gaussian centers ci were picked to be evenly spaced between -2 and 2, while 0(O) = -2 -1 -1 -0.5 05 15 X Figure 4.9 The output of the approximator (-) and training points (o’s) Sec 4.5 Summary IO’ -3 IO” lo-' 10' k Figure 4.10 Value of the cost function during learning when using the conjugate gradient approach (-) and gradient descent (- -) The golden section search algorithm was used for the line minimization in the conjugate gradient routine Figure 4.9 shows the (xi, y”) data pairs used for training along with ?(x, 6) after the conjugate gradient trajning In Figure 4.10 we notice that the conjugate gradient algorithm is able to reduce the cost function much more quickly n than the gradient routine defined by (4.41) with = 0.01 4.5 Summary Upon completion of this chapter, the reader should have an understanding of l l l Linear least squares techniques (batch and recursive) Nonlinear least squares techniques (gradient methods, discrete time and constra,ined cases) Line search and the conjugate gradient method Optimization 102 The tools provided in this chapter will prove useful when defining the parameters of a fuzzy system or neural network so that some function is approximated When this approximated function represents some nonlinearity describing the dynamics of a#system we wish to control, it may be possible to incorporate the fuzzy system or neural network in the control law to improve closed-loop performance It was also shown that it may be possible to use the same approximator structure to represent multiple functions by simply changing the value of some parameters (see Example 4.5) This is an important property of fuzzy systems and neural networks In general, we will find that fuzzy systems and neural networks are able to approximate any continuous function if enough adjustable parameters are included in their definition This will be the focus of the next chapter 4.6 Exercises and Design Problems 4.1 (Batch and Recursive Least Squares Derivation) In this problem you will derive several of the least squares methods that were developed in this chapter First, using basic ideas from calculus, take the partial of J in (4.7) with respect to and set it equal to zero From this derive an equation for how to pick 8* Compare it to (4.9) (Hint: If m and b are two n x vectors and is an n x n symmetric = b, &mTb = b, and &mTOm = ma#trix (i.e., = OT), then &hTm 20m.) Repeat for the weighted batch least squares approach Finally, derive (4.21) for the weighted recursive least squares approach Exercise 4.2 (Batch Least Squares) Suppose that for Example 4.1 we use the three training data pairs in the training data set G, but add one more In particular, add the pair Exercise ([ ‘;“]:2.2) to G (to get M = 4) Find O* using the (nonweighted) batch least squares approach Plot on a three-dimensional plot (with the z axis as y(k), and the other two axes ~(lc - 1) and u(k)) the training data’ for Example 4.1 (the three points) and the resulting least squares fit (it is a plane) Repeat this for the case considered above where we add one data point Plot the new plane a#nd data points on the same plot and compare Does the change in slope of the plane from the M = to M = case make sense?In what sense? 4.3 (Recursive Least Squares) Suppose that for Example 4.2 we use b(k) = + 0.2 sin(O.O17&) (i.e., we halve the frequencv Exercise Sec 4.6 Exercises and Design Problems of the time-varying lar focus on finding to achieve as good as the case for X = in the example Is 103 parameter) Repeat the example, and in particuthe highest value for X that will cause the estimate of tracking of b(k) as in Example 4.2 (i.e., as good 0.2) Compare the va$lue you find to the one found it bigger or sma’ller? Why? 4.4 (Flat Gradients in Neural Networks) Show that if all the weights of a feedforward neural network are initialized to zeros, then [ = This is undesirable since if < is used within a parameter update routine, then the weights will never change This is why neural network weights are typically initialized to small random values Exercise Exercise 4.5 (Gradient Tuning of Neural Networks) (a) Repeat Example 4.5, but for the case where f(s) = sin2(x), and where you use a multilayer perceptron with a single hidden layer as the approximator Tune the gradient algorithm as necessary to get good approximation Provide a plot of the function f(x) and the approximator on the same graph to compare the estimation accuracy (b) Repeat (a) but use a radial basis function neural network Exercise 4.6 (Gradient Training of Fuzzy Systems) (a) Consider a single-input, single-output (standard) fuzzy system with a total of 20 triangular input and output membership functions If the rule-base is defined such that input membership function i is associated with the ith output membership function, then a total of 20 rules are formed Assume the input membership functions are held constant while the centers of the output membership functions are allowed to vary Use the gradient descent routine to minimize the error between the output of the fuzzy system and the function y = sin(z) over x E [-7-r, 7r) using a total of 50 random test points selected from [-?r, ~1 (b) Repeat (a) b ut use a Takagi-Sugeno fuzzy system with output functions that are affine Exercise 4.7 (Controller Design) II; = f(x) Consider the system defined by + WJ (4.61) If an approximation to f(x) exists such that sup f(x) - F(x, O>l < W exists, then show that using the controller u = -kx - 7=(x,8) will ensure that vv lim 1x1< - (4.62) t SC0 Ihi 104 Optimization Hint: Use the Lyapunov Iv2 pi candidate V = x2 to show that T;’ -kV + 4.8 (Line Search) Use the golden section search to find an x that minimizes the following functions: Exercise f(x) = x2 + II: + = (x2 + 0.1) exp(-x2) f(s) Plot each f( x ) and comment on the ability of the golden section search to find a global minimum 4.9 (Conjugate Gradient Optimization) Use the conjugate gradient routine to adjust the weights of a MLP so that it reasonably matches the following functions over x E [-K, ~1: Exercise f(x) = sin(x) f (2) = cos(x) f(x) = + sin2(x) Try the above for various numbers of nodes in the network ... importance of certain elements of G more than others For example, we may choose to have it put less emphasis on older data by choosing w1 < w2 < - < we when x2 is collected after x1, x3 is collected... initialize the RLS algorithm (i.e., choose B(0) and P(0)) One approach to this is to use 8(O) = and P(0) = PO where PO = 01 for some large a > This is the choice that is often used in practice... some small constant It is our hope that if we choose the set {xi, ,x”} with xi E D such that the xi’s are uniformly distributed throughout D, and we choose the approximation structure properly,