DSpace at VNU: A novel efficient two-phase algorithm for training interpolation radial basis function networks tài liệu,...
ARTICLE IN PRESS Signal Processing 87 (2007) 2708–2717 www.elsevier.com/locate/sigpro A novel efficient two-phase algorithm for training interpolation radial basis function networks$ Hoang Xuan Huana, Dang Thi Thu Hiena, Huu Tue Huynhb,c,à a Faculty of Information Technology, College of Technology, Vietnam National University, Hanoi, Vietnam Faculty of Electronics and Telecommunications, College of Technology, Vietnam National University, Hanoi, Vietnam c Department of Electrical and Computer Engineering, Laval University, Quebec, Canada b Received 18 October 2006; received in revised form 28 April 2007; accepted May 2007 Available online 16 May 2007 Abstract Interpolation radial basis function (RBF) networks have been widely used in various applications The output layer weights are usually determined by minimizing the sum-of-squares error or by directly solving interpolation equations When the number of interpolation nodes is large, these methods are time consuming, difficult to control the balance between the convergence rate and the generality, and difficult to reach a high accuracy In this paper, we propose a twophase algorithm for training interpolation RBF networks with bell-shaped basis functions In the first phase, the width parameters of basis functions are determined by taking into account the tradeoff between the error and the convergence rate Then, the output layer weights are determined by finding the fixed point of a given contraction transformation The running time of this new algorithm is relatively short and the balance between the convergence rate and the generality is easily controlled by adjusting the involved parameters, while the error is made as small as desired Also, its running time can be further enhanced thanks to the possibility to parallelize the proposed algorithm Finally, its efficiency is illustrated by simulations r 2007 Elsevier B.V All rights reserved Keywords: Radial basis functions; Width parameters; Output weights; Contraction transformation; Fixed point Introduction $ This work has been financially supported by the College of Technology, Vietnam National University, Hanoi Some preliminary results of this work were presented at the Vietnamese National Workshop on Some Selected Topics of Information Technology, Hai Phong, 25–27 August 2005 ÃCorresponding author Faculty of Electronics and Telecommunications, College of Technology, Vietnam National University, 144 Xuanthuy, Caugiay, Hanoi, Vietnam Tel.: +84 754 9271; fax: +84 754 9338 E-mail addresses: huanhx@vnu.edu.vn (H.X Huan), dthien2000@yahoo.com (D.T.T Hien), huynh@gel.ulaval.ca, tuehh@vnu.edu.vn (H.T Huynh) Radial basis function (RBF) networks, first proposed by Powell [1] and introduced into neural network literature by Broomhead and Lowe [2], have been widely used in pattern recognition, equalization, clustering, etc (see [3–6]) In a multivariate interpolation network of a function f, the interpolation function is of the form jxị ẳ 0165-1684/$ - see front matter r 2007 Elsevier B.V All rights reserved doi:10.1016/j.sigpro.2007.05.001 M X k¼1 Á À wk h x À vk ; sk ỵ w0 ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 k k such ẩ k ẫN that jx ị ẳ y ; 8k ¼ 1; ; N, where x k¼1 is a set of n-dimensional vectors (called interpolation nodes) and yk ẳ f xk ị is the measured k value of the function f at the node x The real k functions hð x À v ; sk Þ are called RBFs with the centers vk , M(MpN) is the number of RBFs used to approximate f, and wk and sk are unknown parameters to be determined Properties of RBFs were studied in [7–9] The most common kind of 2 RBFs is the Gaussian function hðu; sị ẳ eu =s In interpolation RBF networks, their centers are interpolation nodes; in this case, M ¼ N and vk ¼ xk for all k In network training algorithms, parameters wk and sk are often determined by minimizing the sum-of-squares error or directly solving interpolation equations (see [4,6]) An advantage of interpolation RBF functions, proved by Bianchini et al [10], is that their sum of squared error has no local minima, so that any optimization procedure always gives a unique solution The most common training algorithm is the gradient descent method Despite the fact that the training time for an RBF network is shorter than that for multiple-layered perceptron (MLP), it is still rather long and the efficiency of any optimal algorithm depends on the choice of initial values [ref] On the other hand, it is difficult to obtain small errors, and it is not easy to control the balance between the convergence rate and the generality, which depends on the radial parameters Consequently, interpolation networks are only used when the number of interpolation nodes is not too large Looney [5] suggests to use this network when the number of interpolation nodes is less than 200 Let us consider an interpolation problem in the R4 space, with 10 points on each dimension The total number of nodes is 104; even with this relatively high figure, the nature of the interpolation problem is still very sparse With known methods, it is impossible to handle this situation In this paper, we propose a highly efficient two-phase algorithm for training interpolation networks In the first phase, the radial parameters sk are defined by balancing between the convergence rate and the generality In the second phase, the output weights wk are determined by calculating the fixed point of a given contraction transformation This algorithm converges quickly, and can be parallelized in order to reduce its running time Furthermore, this method gives a high accuracy Preliminary results show that this algorithm works well even when the interpolation nodes are relatively large as high as 5000 nodes 2709 This paper is organized as follows In Section 2, RBF networks and usual training methods are briefly introduced Section is dedicated to the new training algorithm Simulation results are presented in Section Finally, important features of the algorithm are discussed in the conclusion Interpolation problems and RBF networks: an overview In this section, the interpolation problem is stated first, then Gaussian RBFs and interpolation RBF networks are briefly introduced 2.1 Multivariate interpolation problem and radial basis functions 2.1.1 Multivariate interpolation problem Consider a multivariate f: D(CRn)-Rm È k k ÉNfunction k and a sample set x ; y k¼1 ðx Rn ; yk Rm ịsuch that f(xk) ẳ yk for k ¼ 1, y, N Let j be a function of a known form satisfying interpolation conditions: jxi ị ẳ yi ; 8i ¼ 1; ; N (1) Eq (1) helps determine the unknown parameters in j The points xk are called interpolation nodes, and the function j is called interpolation function of f and used to approximate f on the domain D In 1987, Powell proposed to use RBFs as interpolation function j This technique, using Gaussian RBF, is described in the following; for further details, see [4–6] 2.1.2 Radial basis function technique Without loss of generality, it is assumed that m is equal to The interpolation function j has the following form: jxị ẳ N X wk jk xị ỵ w0 , (2) kẳ1 where k 2 jk xị ẳ ekxv k =sk (3) is the kth RBF corresponding to the function qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn hðx À vk ; sk Þ in Section 1, kuk ¼ i¼1 ui is the Euclidean norm of u; the interpolation node vk is the center vector of jk; sk and wk are unknown parameters to be determined For each k, the parameter sk, also called the width of jk, is used ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2710 to control the domain of influence of the RBF jk If x À vk 43sk then jk(x) is almost negligible In the approximation problem, the number of RBFs is much smaller than N and the center vectors are chosen by any convenient methods By inserting the interpolation conditions of (1) into (2), a system of equations is obtained in order to determine the sets fsk g and fwk g: jxi ị ẳ N X wk jk xi ị ỵ w0 ¼ yi ; 8i ¼ 1; ; N k¼1 (4) Taking (3) into account, i.e vk ¼ xk, it gives N X i k 2 wk eÀkx Àx k =sk ¼ yi À w0 ¼ zi ; 8i ¼ 1; ; N k¼1 (5) If the parameters sk are selected, then we consider the N  N matrix F: (6) F ¼ jk;i ịNN with i jk;i ẳ jk xi ị ẳ eÀjjx Àx k jj =s2k Michelli [11] has proved that if the nodes xk are pairwise different, then F is positive-definite, and hence invertible Therefore, for any w0, there always exists unique solutions w1,y,wN for (5) The above technique can then be used to design interpolation RBF neural networks (hereinafter called interpolation RBF networks) 2.2 Interpolation RBF networks An interpolation RBF network which interpolates an n-variable real function f: D(CRn)-Rm is a 3-layer feedforward network It is composed of n nodes of the input layer, represented by the input vector xARn, N hidden neurons, of which the kth output is the value of radial function jk, m output neurons which determine interpolated values of f The hidden layer is also called RBF layer Like other two-phase algorithms, one advantage of this new algorithm is that m neurons of the output layer can be trained separately There are many different ways to train an RBF network Schwenker et al [6] categorize these training methods into one-, two-, and three-phase learning schemes In one-phase training, the widths sk of the RBF layer are set to a predefined real number s, and only the output layer weights wk are adjusted The most common learning scheme is two-phase training, where the two layers of the RBF network are trained separately The width parameters of the RBF layer are determined first, the output layer weights are then trained by a supervised learning rule Three-phase training is only used for approximation RBF networks; after the initialization of the RBF networks utilizing two phase training, the whole architecture is adjusted through a further optimization procedure The output layer may be determined directly by solving (4) However, when the number of interpolation nodes reaches hundreds, these methods are unstable Usually, in a training algorithm, the output weights are determined by minimizing the sum-of-squares error, which is dened as Eẳ N X jxk ị yk (7) k¼1 Since the function E does not have local minima, optimization procedures always give a good solution for (4) In practice, the training time of an RBF network is much shorter than that of an MLP one However, known methods of multivariate minimizing still take rather long running times, and it is difficult to reach a very small error, or to parallelize the algorithm structure Moreover, the width parameters of the RBFs also affect the network quality and the training time [5] Preferably, these parameters should be chosen large when the number of nodes is small, and small when the number of nodes large Therefore, they can be used to control the balance between the convergence rate and the generality of the network In the following, a new two-phase training algorithm is proposed Briefly, in the first phase, the width parameters sk of the network are determined by balancing between its approximation generality and its convergence rate, and in the second phase, the output weights wk are iteratively adjusted by finding the corresponding fixed point of a given contraction transformation Iterative training algorithm The main idea, which is stated in the following basic theorem, of the new training algorithm is based on a contraction-mapping related to the matrix F ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2711 3.1 Basic theorem matrix,  Let I beÃT the identity  ÃT W ¼ w1 Á Á Á wN and Z ¼ z1 Á Á Á zN , respectively, the output weight vector and the right-hand side of (5) By setting h i C ¼ I À F ¼ ck;j , (8) NxN we have ( ck;j ¼ Fig Network training procedure i k 2 ÀeÀjjx Àx jj =sk if : k ¼ j; if : kaj: (9) Then, (4) can be expressed as following: W ẳ CW ỵ Z (10) As mentioned in Section 2.1, if the width parameters sk and w0 are determined then (10) always has a unique solution W First, we set the average of all yk to w0 by w0 ¼ N 1X yk N k¼1 (11) Now, for each kpN, we have the following function qk with argument sk: qk ¼ N X c k;j (12) j¼1 Theorem The function qk ðsk Þ is increasing Also, for every positive number qo1, there exists a sk such that qk is equal to q Proof From (9) and (12), we can easily verify that qk is an increasing function of sk Moreover, we have lim qk ¼ N À sk !1 and lim qk ¼ sk !0 (13) Because the function qk is continuous, for every qA(0,1) there exists a sk such that qk(sk) ¼ q The theorem is proved & This theorem shows that for each given positive value qo1, we can find a set of values fsk gN k¼1 such that the solution W* of (10) is the fixed point of the contraction transformation CW+Z corresponding to the contraction coefficient q 3.2 Algorithm description Given an error e, a positive number qo1 and a given 0oao1, the objective of our algorithm is to determine the parameters sk and W* In the first phase, sk are determined such that qkpq, and if sk is replaced by sk =a È then qÉk4q Therefore, the norm kCkà ¼ maxkukp1 kCukà of the matrix C induced by the vector norm k:kà defined in Eq (14) is smaller than q In the second phase, the solution W* of Eq (10) is iteratively adjusted by finding the fixed point of the contraction transformation CW ỵ Z The algorithm is specied in Fig and described in detail thereafter 3.2.1 Phase Determining width parameters The first phase of the algorithm is to determine the width parameters sk such that qkpq and closest to q; i.e., if we replace sk by sk =a then qk4q Given a positive number ao1 and an initial width s0, pwhich might be chosen to be equal to ffiffiffi 1=ð 2ð2NÞ1=n Þ as suggested in [5], then the algorithm performs the following iterative procedure 3.2.2 Phase Determining output weights To determine the solution W* of Eq (10), the following iterative procedure is executed For each n-dimensional vector u, we denote by kukà the following norm: kukà ¼ N X uj (14) j¼1 The end condition of algorithm can be chosen from one of the following equations: q W À W p (15) (a) à 1Àq À Á (b) tX ln 1 qị=kZk ẳ ln ln kZ k ỵ ln1 qị , ln q ln q (16) where t is the number of iterative time ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2712 These end conditions are suggested from the following theorem of convergent property 3.3 Convergence property The following theorem ensures the algorithm convergence and allows us to estimate its error Theorem The algorithm always ends after a finite number of iterations, and the final error is bounded by W À W à p (17) à Proof First, from the conclusion of the Theorem 1, it can be seen that the first phase of the algorithm always ends after a finite number of steps and qkpq for every k On the other hand, the norm kCkà of matrix C induced by the vector norm k:kà in Eq (14) is determined by the following equation (see [2, Theorem 2, Subsection 9.6, Chapter I]): È É kCkà ¼ max qk pq (18) kpN Phase 1: Beside n and N, the complexity of the first phase depends on the distribution of interpolaÈ ÉN tion nodes xk k¼1 and does not depend on the function f Depending on the initial choices of s0, that there can be p4q (corresponding to step of Fig 2) or poq (corresponding to step of Fig 2) For the former case, for every kpN, let mk be the number of iterations in step such that qk4q with sk ¼ amk À1 s0 but qkpq with sk ¼ amk s0 Therefore, smin mk ploga where smin ¼ minfsk gðps0 Þ (21) s0 In the same manner, if mk is number of iterations in step then, smax mk ploga where smax ẳ maxfsk gXs0 ị s0 (22) Let ' & smin smax c ¼ max loga ; log , s a s Therefore, phase corresponds to the procedure of finding the fixed point of the contraction transformation Cu þ Z with the contraction coefficient q with respect to the initial approximation u0 ¼ and u1 ¼ Z It follows that if we perform t iterative steps in phase 2, then W1 corresponds to the (t+1)th approximate solution ukỵ1 of the xed point W* of the contraction transformation Using Theorem in subsection 12.2 of [4], the training error can be bounded by tỵ1 tỵ1 W À W à p q u1 À u0 ¼ q kZ kà à à 1Àq 1Àq (19) It is easy to verify that expression (16) is equivalent to the equation qtỵ1 =1 qịkZ kà p Then the statement holds if the end condition b) is used On the other hand, in Eq (19) at t ¼ 0, with W ¼ u0 , then u1 ¼ W and W À W à p q W À W (20) à à 1Àq Combining (15) and (20) gives (17) Then the statement holds if the end condition (a) is used The theorem is proved & 3.4 Complexity of the algorithm In this section, the complexity for each phase of the algorithm is analyzed (23) then the complexity of phase is O(cnN2) (Fig 3) Phase 2: The number T of the iterations in phase depends on the norm kZ kà of the vector z and the value q It follows from (16) and the proof of Theorem that T can be estimated by À Á ln ð1 À qÞ=ðkZ kÃ Þ 1 qị Tẳ (24) ẳ logq kZ k ln q Therefore, the complexity of the phase is O(TnN2) Hence, the total complexity of this new algorithm is O((T+c)nN2) Simulation study Simulations for a 3-input RBF network are performed in order to test the training time and the generality of the algorithm Its efficiency is also compared to that of the gradient algorithm The network generality is tested by the following procedure: first, choose some points that not belong to the set of interpolation nodes, then after the network has been trained, the network outputs are compared to the true values of the function at these points in order to estimate the error Because all norms in a finite-dimensional space are equivalent (see [12, theorem in Section 9.2]) then instead of the norm kÈukà Édetermined by (14), the norm kukà ¼ maxjpN uj is used for the end condition (15) Since kukà pN kukà , this change does not influence the convergent property of algorithm ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2713 Fig Specification of the first phase of the algorithm Fig Specification of the second phase of the algorithm 4.1 Test of training time 4.1.1 Testing results Simulations are done for the function y ¼ x21.x2+ sin(x2+x3+1)+4 with x1A[0,3], x2A[0,4], x3A[0,5] The stopping rule is for e ¼ 10À6 Parameters q and a are set in turn to 0.5 and 0.5; 0.8 and 0.5; 0.8 and 0.7; 0.8 and 0.8, respectively, with the number of nodes varying from 100 to 5000 The simulation results are presented in Table Table shows results for q ¼ a ¼ 0.7 with 2500 nodes and different values for the stopping error e Comments: From these results, it is observed that: The training time reflecting the convergence rate is examined for several numbers of nodes and for different values of parameters q, a and e (1) The training time of our algorithm is relatively short (only about several minutes for case of about 3000 nodes) It increases when q or a The data in this following example are obtained by approximately scaling each dimension and finally combining the overall coordinates, and among these, choose the data The simulations are run on a computer with the following configuration: Intel Pentium Processor, 3.0 GHz, 256 MB DDR RAM The test results and comments of the algorithm for the function y ¼ x21x2+sin(x2+ x3+1)+4 are presented below ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2714 Table Training time for the stopping rule defined by e ¼ 10À6 Number of nodes e ¼ 10À6, q ¼ 0.5, a ¼ 0.5 Training time (00 ) e ¼ 10À6, q ¼ 0.8, a ¼ 0.5 Training time (00 ) e ¼ 10À6, q ¼ 0.8, a ¼ 0.7 Training time (00 ) e ¼ 10À6, q ¼ 0.8, a ¼ 0.8 Training time (00 ) 100 400 1225 2500 3600 4900 30 167 378 602 35 170 390 886 37 173 530 1000 45 174 597 1125 Table Training time for cases of 2500 nodes, q ¼ a ¼ 0.7 and e is changed Number of nodes e ¼ 10À9 Training time (00 ) e ¼ 10À8 Training time (00 ) e ¼ 10À6 Training time (00 ) e ¼ 10À5 Training time (00 ) 2500 177 175 172 170 increase This means that smaller are q or a, shorter is the training time However, the training time is more sensitive in regard of a than of q (2) When the stopping error is reduced, the total training time very slightly changes This feature means that the high accuracy required for a given application does not strongly affect the training time 4.2 Test of generality for q or a is changed sequentially To avoid unnecessary long running time, in this part, we limit the node number to 400 These nodes scattered in the domain {x1A[0,3], x2A[0,4], x3A [0,5]} are generated as described above, and the network is trained for different values of q and a, with the stopping error e ¼ 10À6 After the training is completed, the errors at randomly chosen points that not belong to the trained nodes are checked Test results for cases that q or a is changed sequentially are presented in Tables and 4.2.1 Test with q ¼ 0.8 and a is changed sequentially Testing results: Experiment results for e ¼ 10À6, q ¼ 0.8 and a is set in turn to 0.9; 0.8; 0.6; 0.4 are presented in Table Comment: From these results, it can observed that when a increases the checked errors decrease quickly It implies from that when a is small, width parameters sk should be also small, this influences the generality of network In our experience, it is convenient to set aA[0.7,y,0.9]; but the concrete choice depends on the balance between demanded training time and generality of network 4.2.2 Test with a ¼ 0.9 and q is changed sequentially Testing results: The results for e ¼ 10À6, a ¼ 0.9 and q is set in turn to 0.9; 0.7; 0.5; 0.3 are presented in Table Comment: These results have shown that the generality of network strongly increases when q increases, although the change of q weakly influences to training time as mentioned in Section 4.1 4.3 Comparing to gradient algorithm We have performed simulations for the function y ¼ x21x2+sin(x2+x3+1)+4 with 100 interpolation nodes and x1A[0,3], x2A[0,4], x3A[0,5] For the gradient algorithm family, it is very difficult to reach a high training accuracy and it is also difficult to control the generality of networks Beside their training time, accuracy at trained nodes and error at untrained nodes (generality) obtained by the gradient method and by our algorithm are now compared The program of the gradient algorithm is built by using Matlab 6.5 4.3.1 Test of accuracy at trained nodes We randomly choose nodes in 100 interpolation nodes After the training network by our algorithm with e ¼ 10À6, q ¼ 0.8, a ¼ 0.9 (training time: s) and by the gradient algorithm in two cases in turn to 100 loop times (training time: s) and 10,000 loop ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2715 Table Checking error at nodes with q ¼ 0.8; e ¼ 10À6 and a is set in turn to 0.9; 0.8; 0.6 Co-ordinate of checked point X1 X2 X3 2.68412 2.21042 2.842314 2.842315 2.05235 2.84202 2.051234 2.52621 2.94652 1.052145 2.525423 3.789123 3.78235 3.789241 3.15775 3.36832 3.329423 0.040721 0.048435 3.283235 1.63321 3.283023 0.59763 0.86412 Original function value 26.065739 10.007523 23.983329 35.587645 20.063778 35.582265 16.287349 24.627938 q ¼ 0.8, a ¼ 0.9 (training time ¼ 500 ) q ¼ 0.8, a ¼ 0.8 (training time ¼ 400 ) q ¼ 0.8, a ¼ 0.6 (training time ¼ 400 ) Interpolation value Error Interpolation value Error Interpolation value Error 26.0679 10.0024 24.01001 35.5818 20.05203 35.5986 16.28183 24.67451 21.6  10À4 51.24  10À4 266.81  10À4 58.45  10À4 117.48  10À4 163.34  10À4 55.16  10À4 465.72  10À4 26.06879 10.0144 24.0201 35.5799 20.0803 35.5621 16.294 24.58628 30.502  10À4 68.763  10À4 367.706  10À4 77.452  10À4 165.219  10À4 201.655  10À4 66.505  10À4 416.584  10À4 26.0691 10.0146 24.0251 35.5963 20.0812 35.561 16.295 24.5798 33.802  10À4 71.163  10À4 417.90  10À4 86.548  10À4 174.21  10À4 212.65  10À4 78.505  10À4 481.38  10À4 149.97  10À4 Average error 174.298  10À4 194.522  10À4 Table Checking error at nodes with a ¼ 0.9; e ¼ 10À6 and q is set in turn to 0.9; 0.7; 0.5 Co-ordinate of checked point X1 X2 X3 Original function value 2.68412 2.21042 2.842314 2.842315 2.05235 2.84202 2.051234 2.52621 2.94652 1.052145 2.525423 3.789123 3.78235 3.7892411 3.15775 3.36832 3.32942 0.04072 0.04843 3.28323 1.63321 3.28302 0.59763 0.86412 26.06573 10.00752 23.98332 35.58764 20.06377 35.58226 16.28734 24.62793 Average error q ¼ 0.9, a ¼ 0.9 q ¼ 0.7, a ¼ 0.9 Interpolation value Error Interpolation value Error Interpolation value Error 26.0655 10.0217 24.0112 35.5818 20.1105 35.5881 16.2853 24.6117 2.22  10À4 141.79  10À4 279.17  10À4 58.03  10À4 467.62  10À4 58.26  10À4 20.73  10À4 162.8  10À4 26.0654 10.0196 24.0204 35.5819 20.1159 35.5884 16.2852 24.6133 3.12  10À4 120.33  10À4 370.87  10À4 57.27  10À4 520.95  10À4 61.45  10À4 21.13  10À4 146.16  10À4 26.0693 10.0224 24.0221 35.5818 20.1135 35.5886 16.2775 24.6108 35.46  10À4 149.06  10À4 387.53  10À4 58.08  10À4 497.7  10À4 63.11  10À4 98.93  10À4 171.74  10À4 148.83  10À4 times (training time: 180 s), we check the errors at the chosen nodes to compare the accuracy of the algorithms Testing results: The experiment results are presented in Table Comment: It can be observed that our algorithm is much better than the gradient algorithm in both training time and accuracy This fact seems natural, because the gradient algorithm uses an optimization procedure And it is known that it is difficult to obtain a high accuracy in using any optimization procedure 4.3.2 Comparing of generality We randomly choose untrained nodes After the training network by two algorithms with the same parameters in Section 4.3.1, we check the errors at the chosen nodes to compare the generality of the algorithms q ¼ 0.5, a ¼ 0.9 162.67  10À4 182.71  10À4 Testing results: Experiment results are presented Table Comments: From these results, it is very important to observe that in MLP networks, it is well known when the training error is small, the overfit phenomenon might happen [13] But for RBF networks, the RBFs only have local influence such that when data are not noisy, the overfit phenomenon would not be a serious problem In fact, the simulation results show that this new algorithm offers a very short training time with a test error very small, compared to the gradient algorithm Discussion and conclusion This paper proposes a simple two-phase algorithm to train interpolation RBF networks The first phase is to iteratively determine the widths of ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 2716 Table Checking error at trained nodes to compare accuracy Co-ordinate of checked node Original function value X1 X2 X3 1.666667 0.333333 2.666667 0.666667 2.666667 1.666667 2.333333 2.666667 0.000000 0.444444 0.444444 1.333333 1.333333 1.777778 0.444444 3.555556 0.000000 1.379573 1.536421 0.128552 1.589585 0.088890 0.039225 0.852303 4.841471 4.361647 7.320530 5.221158 12.77726 9.209746 7.415960 28.51619 Gradient algorithm with 100 loop times training time: 100 Gradient algorithm 10,000 loop times training time: 18000 New algorithm e ¼ 10À6, q ¼ 0.8, a ¼ 0.9 training time:100 Interpolation value Error Interpolation value Error Interpolation value Error 4.4645 3.5933 8.7058 4.0646 12.5041 6.6682 6.7228 28.0927 3769.7  10À4 7683.4  10À4 13852.7  10À4 11565.5  10À4 2731.6  10À4 25415.4  10À4 6931.5  10À4 4234.9  10À4 5.0959 3.6708 7.2647 4.9517 12.1965 9.2944 7.48 29.2798 2544.2  10À4 6908.4  10À4 558.2  10À4 2694.5  10À4 5807.6  10À4 846.5  10À4 640.4  10À4 7636.0  10À4 4.84146 4.36166 7.32052 5.22117 12.7772 9.20972 7.41596 28.5162 0.1  10À4 0.09  10À4 0.08  10À4 0.1  10À4 0.7  10À4 0.2  10À4 0.005  10À4 0.09  10À4 9523.1  10À4 Average error 3454.5  10À4 0.19  10À4 Table Checking error at untrained nodes to compare the generality Co-ordinate of checked node X1 X2 X3 0.32163 0.67123 1.68125 0.34312 2.65989 1.67013 2.65914 1.3163 0.45123 0.8912 1.34121 1.78123 3.56012 2.23123 3.56123 0.44925 1.38123 1.4512 0.27423 2.56984 0.8498 0.29423 0.85612 1.12987 Average error Original function value 4.350910 4.202069 8.293276 3.406823 28.42147 9.84913 28.41991 5.311670 Gradient algorithm with 100 loop times training time: 100 Gradient algorithm 10,000 loop times training time: 18000 New algorithm e ¼ 10À6, q ¼ 0.8, a ¼ 0.9 training time:100 Interpolation value Error Interpolation value Error Interpolation value Error 2.1394 2.8529 6.1078 3.2115 27.5174 8.6415 27.5147 3.5188 22115.1  10À4 13491.6  10À4 21854.7  10À4 1953.2  10À4 9040.7  10À4 12076.3  10À4 9052.1  10À4 17928.7  10À4 3.9309 4.7884 8.3869 4.1438 29.1648 9.5863 29.1634 5.3729 4200.1  10À4 5863.3  10À4 936.2  10À4 7369.7  10À4 7433.2  10À4 2628.3  10À4 7434.8  10À4 612.2  10À4 4.32214 4.20115 8.30878 3.399 28.429 9.79204 28.419 5.28737 287.7  10À4 9.1  104 155.0  10À4 78.2  10À4 75.2  10À4 570.9  10À4 9.1  10À4 243.0  10À4 13439.0  10À4 Gaussian RBF associated to each node And each RBF is trained separately from others The second phase iteratively computes the output layer weights by using a given contraction mapping It is shown in this paper that the algorithm always converges; and the running time only depends on the initial values of q, a, e on the distribution of the interpolation nodes and on the vector norm of the interpolated function computed at these nodes Owing to numerical advantages of contraction transformations, it is easy to obtain very small training errors, and it is also easy to control the balance between its convergence rate and the generality of network by setting appropriate values to parameters One of the most important features of this algorithm is the output layer weights can be trained independently such that the whole algorithm can be parallelized Furthermore, for a large 4559.7  10À4 178.56  10À4 network, the stopping rule based on the norm of N-dimensional vectors can be replaced by a much simpler one defined in Eq (16) to avoid lengthy computations When the number of nodes is very large, the clustering approach can be used to regroup data into several sets with smaller size By doing so, the training can be parallely done for each cluster and helps to reduce the training time The obtained networks are called local RBF networks This approach might be considered as equivalent to the spline method and it will be presented in a forthcoming paper In the case of a very large number N of nodes, and for the point of view of neural network as associative memory, another approach can be exploited In fact, an approximate RBF network can be designed with the number of hidden nodes ARTICLE IN PRESS H.X Huan et al / Signal Processing 87 (2007) 2708–2717 much lesser than N, based on the following scheme First, the data set is partitioned into K clusters fC i gK i¼1 by using any clustering algorithm (for example, the k-mean method) Then the center ni of the RBF associated to the ith hidden neuron can be chosen to be the mean vector di of Ci or the vector in equation C i nearest to di The network is trained by theÈ algorithm with the set of new interpolation Ék nodes ni i¼1 Other choice based on any variation of this approach can be made, dependently on the context of the desired applications The advantage of RBF networks is its local influence property (see [13,14]); so that width parameters are generally chosen small (see Section 6.1, pp 288–289 of [15]), especially, in [5, Section 7.7, p 262]) it is suggested to choose s ¼ 0.05 or 0.1 In Section 3.11, p 99 of [5], it is also suggested to use s ẳ 1=2Nị1=n , which is very small Therefore, qk must be small In fact, the condition qko1 presents some limitations, but does not affect the algorithm performance Our iterative algorithm is based on the principle of contraction mapping; to insure the contraction property, q chosen by (12) is fundamental to this algorithm, so that this is rather empirical and it does not correspond to any optimum consideration In RBF networks, to determine the optimum width parameters (or q) is still an open problem References [1] M.J.D Powell, Radial basis function approximations to polynomials, in: Proceedings of the Numerical analysis 1987, Dundee, UK, 1988, pp 223–241 2717 [2] D.S Broomhead, D Lowe, Multivariable functional interpolation and adaptive networks, Complex Syst (1988) 321–355 [3] E Blanzieri, Theoretical interpretations and applications of radial basis function networks, Technical Report DIT-03023, Informatica e Telecomunicazioni, University of Trento, 2003 [4] S Haykin, Neural Networks: A Comprehensive Foundation, second ed., Prentice-Hall Inc., Englewood Cliffs, NJ, 1999 [5] C.G Looney, Pattern Recognition Using Neural Networks: Theory and Algorithm for Engineers and Scientist, Oxford University Press, New York, 1997 [6] F Schwenker, H.A Kesler, G Palm, Three learning phases for radial-basis-function networks, Neural Networks 14 (4–5) (2001) 439–458 [7] E.J Hartman, J.D Keeler, J.M Kowalski, Layered neural networks with Gaussian hidden units as universal approximations, Neural Comput (2) (1990) 210–215 [8] J Park, I.W Sandberg, Approximation and radialbasis-function networks, Neural Comput (3) (1993) 305–316 [9] T Poggio, F Girosi, Networks for approximating and learning, IEEE Proc 78 (9) (1990) 1481–1497 [10] M Bianchini, P Frasconi, M Gori, Learning without local minima in radial basis function networks, IEEE Trans Neural Networks (3) (1995) 749–756 [11] C Michelli, Interpolation of scattered data: distance matrices and conditionally positive definite functions, Constr Approx (1986) 11–22 [12] L Collatz, Functional Analysis and Numerical Mathematics, Academic Press, New York, 1966 [13] T.M Mitchell, Machine Learning, McGraw-Hill, New York, 1997 [14] H.X Huan, D.T.T Hien, An iterative algorithm for training an interpolation RBF networks, in: Proceedings of the Vietnamese National Workshop on Some Selected Topics of Information Technology, Haiphong, Vietnam, 2005, pp 314–323 [15] Mousoun, Fundamental of Artificial Neural Networks, MIT Press, Cambridge, MA, 1995 ... the interpolation problem is stated first, then Gaussian RBFs and interpolation RBF networks are briefly introduced 2.1 Multivariate interpolation problem and radial basis functions 2.1.1 Multivariate... of a given contraction transformation Iterative training algorithm The main idea, which is stated in the following basic theorem, of the new training algorithm is based on a contraction-mapping... data in this following example are obtained by approximately scaling each dimension and finally combining the overall coordinates, and among these, choose the data The simulations are run on a