DSpace at VNU: Reinforcement learning-based intelligent tracking control for wheeled mobile robot

DSpace at VNU: Reinforcement learning-based intelligent tracking control for wheeled mobile robot tài liệu, giáo án, bài...

Trang 1

Measurement and Control

Transactions of the Institute of

http://tim.sagepub.com/content/early/2014/03/10/0142331213509828

The online version of this article can be found at:

DOI: 10.1177/0142331213509828

published online 10 March 2014

Transactions of the Institute of Measurement and Control

Nguyen Tan Luy, Nguyen Thien Thanh and Hoang Minh Tri

Reinforcement learning-based intelligent tracking control for wheeled mobile robot

Published by:

http://www.sagepublications.com

On behalf of:

The Institute of Measurement and Control

can be found at:

Transactions of the Institute of Measurement and Control

Additional services and information for

http://tim.sagepub.com/cgi/alerts

Email Alerts:

http://tim.sagepub.com/subscriptions

Subscriptions:

http://www.sagepub.com/journalsReprints.nav

Reprints:

http://www.sagepub.com/journalsPermissions.nav

Permissions:

What is This?

- Mar 10, 2014

OnlineFirst Version of Record

>>

Trang 2

Transactions of the Institute of Measurement and Control 1–10

Ó The Author(s) 2014 Reprints and permissions:

sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0142331213509828 tim.sagepub.com

Reinforcement learning-based

intelligent tracking control for wheeled

mobile robot

Nguyen Tan Luy1, Nguyen Thien Thanh2and Hoang Minh Tri2

Abstract

This paper proposes a new method to design a reinforcement learning-based integrated kinematic and dynamic tracking control algorithm for a non-holonomic wheeled mobile robot without knowledge of the system’s drift tracking dynamics The actor critic structure in the control scheme uses only one neural network to reduce computational cost and storage resources A novel tuning law for a single neural network is designed to learn an online solution of a tracking Hamilton–Jacobi–Isaacs (HJI) equation The HJI solution is used to approximate an H N optimal tracking performance index func-tion and an intelligent tracking control law in the case of the worst disturbance The laws guarantee closed-loop stability in real time The convergence and stability of the overall system are proved by Lyapunov techniques The simulation results on a non-linear system and wheeled mobile robot verify the effectiveness of the proposed controller.

Keywords

Actor critic, Hamilton–Jacobi–Isaacs equation, neural network, wheeled mobile robot

Introduction

An important motion control problem for the system of

wheeled mobile robots (WMRs) is the trajectory tracking

This problem has been extensively studied in past few

decades Generally, a variety of control algorithms for the

trajectory tracking problem has been devoted in the form of

adaptive control (Fierro and Lewis, 1998; Marvin et al., 2009;

Mohareri et al., 2012) where the back-stepping techniques are

used The kinematic controllers are designed using the

avail-able models, and dynamic controllers are designed based on

neural networks (NNs) They are considered indirect adaptive

controllers Besides, they do not minimize any long-term

per-formance function and hence are not optimal HNadaptive

control for a WMR based on inverse optimality is proposed

in Miyasato (2008) but it is an offline control scheme A

spe-cific characteristic of the WMR models is that it can be

pre-sented as a non-linear system in a strict-feedback form, but

until now, to the best knowledge of the authors, methods of

tracking control for a WMR using this form are just

consid-ered in adaptive back-stepping (Chwa, 2010) or adaptive

feed-back linearization schemes (Khoshnam et al., 2011) without

any optimality

In the other direction, thanks to the abilities of online

adaptive learning of reinforcement learning (RL) methods in

optimal control, tracking control methods for WMRs have

been studied The adaptive critic structures in RL are

exploited to learn discrete controllers (Lin and Yang, 2008;

Zenon and Marcin, 2011) or a continuous controller without

disturbance using the learned solution of the Hamilton–

Jacobi–Bellman (HJB) equation (Luy, 2012) These

controllers not only overcome the drawbacks of the other methods such as the domain expert of fuzzy or existing con-trollers to generate a training sample for NNs, but also opti-mize utility functions, in contrast to the tracking error at the current time instant in the NN-based adaptive controllers However, these methods have access to the known explicit model of WMR and ignored the disturbance, so they are not

a type of robust adaptive control method

To control a non-linear system, i.e a WMR system with optimality related to disturbances using RL, the solutions of Hamilton–Jacobi–Isaacs (HJI) in the H‘ optimal control problem must be learned (Dierks and Jagannathan, 2010) The integral RL-based direct adaptive control algorithm for a class of general non-linear system has been studied in Vamvoudakis et al (2011) to solve the HJI equation The most favourable part of this algorithm is that NNs can be trained synchronously to approximate optimal control input and worst-case disturbance without knowledge of the system drift dynamics terms However, it requires three NNs in the same structure – one for the critic and the others for actors The number of neurons in the hidden layers should be at least

1

Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam

2

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam Corresponding author:

Nguyen Tan Luy, Division of Automation Electronics, 305A BaHom District, Kienthanh Building, Ho Chi Minh City, Ho Chi Minh 70000, Vietnam.

Email: luynguyentan@yahoo.com

Trang 3

(n + 1)n/2, where n is number of state variables In practical

applications, e.g robotics, the number of state variables

mea-sured from sensors for feedback may be relatively large With

three NNs, the number of NNs weights and the activation

functions representing the elements in combination of the states

will significantly increase If applied directly, the algorithm to a

non-linear system may lead to the computational complexity

and resource consumption In contrast, a method using a single

online approximator (SOLA) in Dierks and Jagannathan

(2010) to solve the HJI equation can reduce the number of

NNs but, unfortunately, it is a type of model-based RL

From the aforementioned problems, there are three main

contributions in the paper The first involves the derivation of a

tracking dynamics formed from a non-linear strict-feedback

model of WMR the purpose of which is to design an integrated

kinematic and dynamic control RL based-intelligent controller,

i.e the integrated kinematic and dynamic robust direct

adap-tive tracking controller with optimality without explicit

knowl-edge of the system’s drift dynamics The actor critic structure in

the RL scheme uses only one NN for the critic law Secondly,

the last contribution is the tuning law for the critic NN so that

solutions of the tracking HJI equation are learned, and

optim-ality values of the tracking performance index function and the

robust direct adaptive control law as well as the worst-case

dis-turbance law are approximated without accessing the system’s

drift dynamics By Lyapunov techniques, the closed-loop

sys-tem state and critic NN error are proved to be uniform ultimate

bounds and system parameters show convergence to optimal

target values asymptotically

The paper is organized as follows The next section

pro-vides the theoretical background of the WMR to establish the

non-linear WMR system in the strict-feedback form and then

the new tracking dynamics is derived Then we design the

inte-grated kinematic and dynamic robust direct adaptive tracking

control scheme with optimality along with tuning law for the

critic NN and give proof of stability and convergence The

results of simulation on the WMR verify the effectiveness of

the proposed algorithm and conclusions are drawn

Strict-feedback kinematic and dynamic

model

A WMR with differentially driven wheels mounted on a

driv-ing axle can move and rotate on the horizontal plane thanks

to two independent actuators Torque from the actuators is

transmitted to the left and the right wheels to drive the robot

The mass of the WMR including the mass of the platform

without the wheels and the mass of the wheels is focused on a

central point The distance of the driven wheels is b1 The

radius of each wheel is r1 The distance from the centre point

to the driving axe is l Without loss of generality, it can be

assume that l=0 The WMR is considered a mechanical

sys-tem with n generalized configuration variables q suffering m

constraints (m\n) and represented by the equation as follows

(Khoshnam et al., 2011)

Hk, j(q, _q) =Xn

i = 1

hk, ji(q, _q) _q = 0, j = 1, , m ð1Þ

where the number of holonomic and non-holonomic con-strains are k and m2k, respectively The concon-strains are inde-pendent of time and can be written as Ak(q) _q = 0, where

Ak2 <m 3 nis a full-rank matrix Assume that S(q)2 <n 3 (nm)

is also a full-rank matrix that is formed from a set of smooth and linearly independent vector fields in the null space of Ak such that Ak(q)S(q) = 0 Let q(t)2 <nmbe the velocity vec-tor, which can be seen as the pseudo-control vecvec-tor, and is important to form a strict-feedback non-linear system after-wards The kinematic equation of WMR motion can be writ-ten as

To derive a dynamic equation of WMR, Lagrange formalism

is used as follows

d dt

∂L

∂ _q

∂L

where FT is the vector of the generalized forces The WMR moves on a plane so Lagrangian L only includes kinetic energy

L =1 2

Xn i

i = 1

vT

imivi+ vT

iIivi ð4Þ

where vi, vi, miand Iiare elements of the linear velocity, the rotation velocity, the mass and the moment of inertia, respec-tively As a result, the dynamic model of the WMR is expressed as

M (q)€q + C(q, _q) _q + B(q)F( _q) + B(q)td= B(q)t AT

k(q)l ð5Þ where M(q)2 <n 3 n is the symmetric positive defined inertia matrix, C(q, _q)2 <n 3 nis the centripetal and Coriolis matrix, F( _q)2 <n 3 1 is the surface friction and gravitational vector,

td2 <(nm) 3 1 denotes the bounded unknown disturbances including unstructured unmodelled dynamics, B(q)2

<n 3 nm ð Þis the input transformation matrix, t2 <(nm) 3 1 is the input torque vector and l2 <m 3 1 is the vector of con-straint forces Taking the time derivative of the kinematic model (2), one obtains

€

q = _S(q)q + S(q) _q ð6Þ Substituting (2), (6) into (5) and multiplying both sides of the result by STð Þ and note that Aq k( _q)S(q) = 0, one obtains

M(q) _q(t) + C(q, _q)q(t) + F( _q) + td= B(q)t ð7Þ where M (q) = STMS, C(q, _q) = STMS + STCS, B(q) = STB,

F( _q) = STM _Sq + BF, td= Btd Definition 1 Letting fq(q) = 0n 3 1, gq(q) = S(q)2 <n 3 nm ð Þ,

fq(q, q) = M1(q) ðC( _q, q)q + F( _q)Þ 2 <(nm) 31, gq(q, q) =

M1(q)B(q)2 <(nm)3 (nm), kq(q, q) = M1(q)2 <(nm)3(nm) The state space equation of WMR represented the non-linear system in the strict-feedback form is obtained by using kinematics and dynamics Equations (2) and (7)

Trang 4

_q = fq(q) + gq(q)q

_

q= fq(q, q) + gq(q, q)t + kq(q, q)td

(

ð8Þ

The system (8) is assumed to be controllable and drift

free with (q, q) = 0, a unique equilibrium point on a compact

set X 2 <2nm Let us view some following important

properties

Property 1.M (q) is a bounded asymmetric and positive

defi-nite matrix such that mmin kM qð Þk mmax, where mmin and

mmaxare positive scalar constants

Property 2 C(q) is bounded such that cmin C(q) cmax,

where cminand cmaxare positive scalar constants

Property 3 The disturbance td is bounded such that

td

k k td max, where td maxis a positive scalar constant

Property 4.fq(q, q) is the system uncertainty dynamics term

and fqðq, qÞ cmaxm1mink k.q

Property 5.gqð Þ is bounded such that gq min g q(q) gmax,

where gminand gmaxare positive scalar constants

Property 6.gq(q, q) is bounded such that m1maxB gk q(q, q)k

m1minB, where B(q), the constant non-singular matrix,

depends on the geometric parameter of the WMR, i.e the

radius r1of wheels and the robot frame width b1(Khoshnam

et al., 2011), and according to Property 1, gq(q, q)6¼ 0

Property 7 kq(q, q) is bounded such that m1

max kk q(q, q)k

m1minand according to Property 1 kq(q, q)6¼ 0

Property 8.fq(q, q), gq(q), gq(q, q) and kq(q, q) are non-linear

smooth functions

Definition 2 If a reference robot generates the bounded

smooth trajectory vector that satisfies the constraint

_qd= S(qd)qrd, where qrd is the smooth velocity vector, the

main objective for the robust adaptive tracking control

prob-lem for WMR is to design integrated kinematic and dynamic

feedback control laws for the dynamic system (8) where

con-tains the uncertainty terms and disturbance, such that when

t! 0, then eq! 0 with eq= q qd Furthermore, a defined

tracking cost function related to (8) must be optimized

To have tracking dynamics for designing integrated

kine-matic and dynamic feedback control law, some steps to

change model (8) will be executed The first equation in (8) is

written as

_q _qd= _eq= _qd+ gq(q)qd+ gq(q)(q qd)

= gq(q)qd+ gq(q)eq

ð9Þ

where eq= q qd2 <nm, qd2 <nmis virtual control input

such that qd= qd+ qdawith qd2 <nmis an optimal

track-ing control input vector designed later, and qda, the

feed-forward virtual control input, is the solution of the equation

0 = _qd+ gq(q)qda ð10Þ Similarly, the last equation in (8) is written as

_

q _qd= _eq= _qd+ fq(q, q) + gq(q, q)t + kq(q, q)td

= fq(q, q) + gq(q, q)t gT

q(q)eq+ kq(q, q)td ð11Þ where tis the tracking control input designed later such that

t= t+ taand tais a solution of the equation

0 = _qd+ gq(q, q)ta+ gTq(q)eq ð12Þ

Definition 3 Letxd= q T, qTdT

2 <(2nm) 3 1, x = q T, qTT

2

<(2nm) 3 1, e = eT, eT

q

h iT

2 <2nm, f (x) = 0n 3 1, fT

q(q, q)

2

<2nm, u= u ua, u= q Td , tTT

2 <2(nm) 3 1, ua=

qTda,tT

T

2<2(nm) 31, g(x)=diag½gq(q),gq(x)2<(2nm)32(nm), k(x)=diag k q(q),kq(q,q)

2<(2nm)32(nm), kq(q)=0n 3(nm),

d = 0 1 3(nm),tTT

2<2(nm)31 Lemma 1 Consider the tracking dynamics of the WMR as follows

_e = f (x) + g(x)u+ k(x)d ð13Þ

If the control law ufor (13) is designed, it can be the control law for (8), that means the control law ufor (13) and (8) is equivalent

Proof For (8), choosing the Lyapunov function candidate as

J = eTe=2 and taking the derivative along with (9) and (11), one obtains

_J = eTfq+ eTgqqd+ eTgqeq+ eT

qfq+ eT

qgqt

eT

qgTeq+ eT

qkqtd

= eT(fq+ gqqd) + eT

q(fq+ gqt+ kqtd)

= eTðf (x) + g(x)u+ k(x)dÞ

ð14Þ

Comparing (14) and (13), it can be seen that the control law

ufor (13) and (8) is equivalent

This completes the proof

Remark 1 If control law u exists, it will be the integrated kinematic and dynamic control law as opposed to back-stepping control laws where kinematic and dynamic control inputs are separated

Remark 2 f (x) represents system’s drift tracking dynamics and d is the bounded unknown disturbance according to Property 3

Fact 1.k k q qq k dk = ek k, thus by Property 4, f (x) is con-strained by f (x) m1mincmaxk k.e

RL-based intelligent tracking control algorithm

According to the defined objective, applying and developing the policy iteration (PI) algorithm of RL for system (13) is an appropriate choice RL can be used to learn online HJB solu-tions for optimal control problems (Vamvoudakis and Lewis, 2010) and HJI solutions for the HN optimal problems

Trang 5

(Vamvoudakis and Lewis, 2011; Vamvoudakis et al., 2011).

Define a value function based on the HN tracking

perfor-mance index function object to (13) (Chen et al., 1998; Chen

et al., 2009; Luy et al., 2010):

V =

ð‘ t r(e(t), u(t), d(t))dt ð15Þ

where r(e, u, d) = Q(e) + uTRu r2dTd, with Q(e) is positive

definite, i.e.8e 6¼ 0 Q(e) 0 and e = 0, ) Q(e) = 0, u is the

admissible control input that minimizes V while d tries to

maximize V (Dierks and Jagannathan, 2010; Vamvoudakis

and Lewis, 2011), R is a symmetric positive definite matrix,

r ris the prescribed disturbance attenuation level, where

r 0 is the minimum gain of r for which the stability of

closed-loop tracking system (13) is guaranteed (Van Der

Shaft, 1992) Define the Hamiltonian of (15) associated with

uand d as

H (e, u, d, Ve) = r(e, u, d) + VT

eðf (x) + g(x)u + k(x)dÞ ð16Þ where Ve= ∂V (e)=∂e There exists a minimum non-negative

local smooth solution of (16) (Dierks and Jagannathan, 2010;

Vamvoudakis et al., 2011) If V

e is that solution and (13) is locally detectable, then the Nash equilibrium solutions in term

of V

e can be found by the stationary condition of (16), i.e u

and d

u(e) =1

2R

1gT(x)Ve ð17Þ

d(e) = 1 2r2kT(x)Ve ð18Þ where V

e= ∂V(e)=∂e The tracking HJI equation is obtained

by substituting (17) and (18) into (16):

Q(e) + VeTf (x)1

4V

T

e g(x)R1gT(x)Ve + 1

4r2VeTk(x)kT(x)Ve= 0, V= 0

ð19Þ

Solutions HJI of (19) can be learned without explicit

knowl-edge of the system’s drift dynamics by an integral RL-based

PI algorithm where three NNs for the actor critic, which are

the same structure, are required (Vamvoudakis et al., 2011)

Using three NNs may lead to the computational complexity

and resource consumption when applying for multivariable

systems such as the WMR system defined earlier Therefore,

in this paper, the new actor critic scheme is proposed for the

tracking problem using only one NN with the purpose of

reducing the cost of computation and storage resources The

critic with the NN to approximate the optimal value function

(15) is defined as

V(e) = WTF(e) + e(e) ð20Þ where F(e) :<n! <N is the activation function vector, N is

the number of neurons in the hidden layer, e(e) is the NN

approximation error and W2 <N is the NN ideal weight

vector F(e) can be selected such that, N! ‘, e(e) ! 0 and

ee(e) = ∂e(e)=∂e! 0, and for fixed N, ke(e)k\emax,

ee(e)

k k\ee max where emax and ee max are positive constants (Finlayson, 1990) Let us substitute (19), (20) into (16) to obtain the NN-based HJI equation

Q(e) + WTFef (x)1

4W

TFeGFTeW

+1

4W

TFeKFTeW + eHJI= 0

ð21Þ

where G = gR1gT, K = kkT

r2, Fe(e) = ∂F(e)=∂e and eHJI is the residual error formed by the NN approximation error

eHJI= eT

ef1

2W

TFeðG KÞee1

4e

T

eðG KÞee

= eT

e(f (x) + gu+ kd) +1

4e

T

eðG KÞee

ð22Þ

when N! ‘, eHJI converges uniformly to zero For fixed N ,

eHJIis bounded on a compact set (Vamvoudakis et al., 2011) Fact 2 According to Properties 5 and 6,G is bounded such that 0 Gmin G Gmax, Gmin= g2

minsmax(R) and Gmax=

g2maxsmin(R), with smin(R) and smin(R) are the largest and smallest eigenvalues of R, respectively

Fact 3 According to Properties 7, K is bounded such that

0 Kmin K Kmax, Kmin= k2

min

r2, Kmax= k2

max

r2 Assumption 1 The closed-loop tracking dynamics of WMR is bounded such that f (x) + guk + kdk gmaxfor the positive constant gmax

The ideal weight vector W (20) is unknown, thus V (e) is val-ued by

^

V (e) = ^WTF(e) ð23Þ Then, the estimated control and disturbance laws become

^ u(e) =1

2R

1g(x)TFTeW^ ð24Þ

^ d(e) = 1 2r2kT(x)FT

The approximate Hamiltonian is obtained by substituting (23), (24) and (18) into (16)

^ H(e, ^W ) = Q(e) + ^WTFef (x) ^WTFeGFTeW =4^ + ^WTFeKFTeW =4^

ð26Þ

Observing Equations (21) and (26), it is straightforward to see that ^W should be tuned to minimize the subject error function related to ^H(e, ^W ) To design a tuning law for ^W that does not depend on f (x), the error function is chosen as E =1

2eT

HeH, where eH=Ðt + T

t H(e, ^^ W )dt with T 0 is a chosen sampling time Then, the tuning law becomes W =_^ a1∂E∂ ^W In addition, due to the approximation error during online learn-ing, it is desired to design the tuning law of ^W such that it not only minimizes E but also guarantees the stabilization of the

Trang 6

system, concurrently If more than one NN is used, the tuning

law of the critic NN is responsible for minimizing E, while the

tuning laws of actor NNs guarantee the robust stability for

the overall system In our case, only one NN is used and thus

both objectives must be intergraded into one, i.e

_^

W =

_^

t + Tet + T eT

tet

WRB+W_^1 Otherwise

8

<

:

ð27Þ where et= e(t), et + T= e(t + T ), and

_^

W 1 = a 1

s

s T s + 1

ð Þ 2

ð

t + T

t

Q(e) +1

4W^

T

F e GFTeW^1

4W^

T

F e KFTeW^

dt + DFTð e(t) Þ ^ W

0

@

1 A

ð28Þ

WRB=a2

2Fe(G K)e ð29Þ where

s=

ð

t + T

t

Fef (x) + g^u + k^d

dt = ð

t + T

t

Fe_edt

=

ð

t + T

t

d F e(t)ð ð ÞÞ = F(et + T) F(et) = DF e(t)ð Þ:

It will be shown in the proof of Theorem 1 that along with

_^

W1in (27) and the added term, WRB, we will guarantee that

the closed-loop system is uniform ultimately bounded (UUB;

Lewis et al., 1999) when the behaviour of the overall system

becomes unstable

The proposed actor critic structure to learn and feedback

control online is shown in Figure 1 It can be seen that the

adaptive tuning law for the single NN in (27) is applied to

update the NN weight such that the error function of

approxi-mate Hamiltonian in (26) is minimized and it does not involve

the system’s drift dynamics, so the intelligent tracking control law defined previously can be obtained

To guarantee the convergence of ^W , the control inputs and disturbance must be fully explored by adding the noise probe

to ^u(e) and ^d(e) That means the Persistence of Excitation (PE) condition in the interval t, t + T½ P with TP 0, for all t must be satisfied (Vamvoudakis et al., 2011)

b1I ð

t + T p

t

sT(t)s(t)dt b2I ð30Þ

where b1and b2are positive constants, s= 1=(sTs+ 1) and

Iis the identity matrix with the appropriate dimension Theorem 1 Let the tracking dynamics of WMR be given by (13) with the objective tracking HJI equation (19), critic NN

be given by (20), the tuning law for critic be defined in (27) and the intelligent tracking control and disturbance laws to approximate the H‘ optimal tracking cost function (15) be defined in (24) and (25), sis satisfied with the condition PE (30) Then, the closed-loop system state e and the NN error

~

W are UUB with the limited number of hidden layer units Furthermore, the approximation errors of control input and worst-case disturbance are bounded such that uk ^uk\eu,

d ^d

\edfor small positive constants eu, ed Proof See Appendix A

The proposed algorithm is represented by the block dia-gram in the Figure 2 Tstopis the time to stop the algorithm, p

is the noise probe and the other parameters are mentioned before

Simulation results

To verify the proposed algorithm, two numerical simulations are offered In the former, a non-linear system is learned and controlled by the proposed algorithm using one NN in com-parison with another one using three NNs (Vamvoudakis et al., 2011) In the latter, the proposed algorithm is applied for the WMR

d x

( )

WMR (Eq 13)

ACTORS

( )

g x

( )

ˆ

W&

,

G K α α1, 2

Q

CRITIC

Disturbance law (Eq 25)

ˆ

W

Φ

e

ˆ

u

ˆ

d

ˆ

T

Φ

+

NN (Eq.23)

R

Adaptive tuning law

Control law (Eq 24)

ˆ ˆT

V =W Φ Wˆ

Φ

( )

k x

( )

g x

Figure 1 The proposed actor critic structure.

Trang 7

Non-linear system

Consider the non-linear system with disturbance inputs, with

a quadratic cost defined as in Vamvoudakis et al (2011):

_x = f (x) + g(x)u + k(x)d ð31Þ where f (x) =hx1+ x2, x3 x3+ 0:25x2ðcos (2x1+ 2)Þ2

0:25

r2

x2ðsin (4x1+ 2)Þ2Tg(x) = 0, cos (2x½ 1) + 2T and

k(x) = 0, sin (4x½ 1) + 2T We simulate in turn the non-linear

system by the proposed algorithm using one NN and the

algorithm in Vamvoudakis et al (2011) using three NNs To

be comparable, the optimal tracking problem in the paper is

transformed to the optimal control problem as Vamvoudakis

et al (2011), by defining the vector of tracking error as

e = x xd, where xd= 0 In this case, the tracking dynamic

Equation (13) is in the form of Equation (31)

In both algorithms, one selects Q = 1 0

0 1

, R = 1, r = 8,

a1= 50, a2= 0:01, T = 0:05 and Tstop=80 s The optimal

value function is V(x) =1x4+1x2, so the optimal inputs

are u(x) =1

2( cos (2x1) + 2)x2 and d(x) = 1

2r 2

( sin (4x1) + 2)x2 by theory The NN activation function

vectors are defined as F(x) = x 2x2x4x4T

and the weight vector of critic NNs are defined as ^W1= ^W1W^1W^1W^1T for the algorithm using one NN and ^W3= ^W3W^3W^3W^3T for one using three NNs All initial values of weights are zeros The other parameters for three NNs can be seen in Vamvoudakis et al (2011)

The convergence of critic parameters of both are shown in Figure 3 In the algorithm using one NN, all parameters con-verge at about 20 s with optimal values ^W1= 0:006½ 0:5 0:2483 0T, while using three NNs, they converge slower,

at about 50 s with optimal values W^3= 0:005½ 0:5 0:2437 0T In addition, the parameters of the NNs for the actor and disturbance in the algorithm using three NNs also converged to the optimal approximate values (see Vamvoudakis

et al., 2011, for more detail) Thus, using (24) and (25), both algorithms give similarly the optimal control inputs u(x) and the optimal disturbance inputs d(x) However, it can be seen that using single NN, the proposed algorithm has reduced the complexity and resources, and given the convergence speed faster than the algorithm using three NNs

Wheeled mobile robot

Consider the WMR defined above With the notation intro-duced before, state vectors and parameters of WMR are

q = x, y, u½ T, q = v, v½ T, r1= 0:05 m, b1= 0:5 m and l = 0,

m = 10 kg, I1= 5 kg:m2where m and I1denote the value of the mass and the moment of inertia of the platform, motors and wheels, respectively Note that with the designed robust adaptive control law, WMR parameters can change online in bounded domains One assumes that the control torques t applied to DC motor-mounted gearboxes are statically related to the voltage input by a constant so the electrical dynamics of the motors can be included in the general distur-bance td such that tk k 3 N :m If the WMR matrices andd desired velocities of the reference robot are defined as C(q, _q) _q = ml2_u2½cos u, sin u, 0T,

Start

( ( ))e t

Φ , Vˆ =Wˆ ( ) ( ( ))T tΦ e t

False

Stop

stop

t T<

True WMR (Eq 13)

Assign T stop, , ,p T Q R, , , ρ α α1, 2

Init x , 0 e , 0 Wˆ0=0, t=0

t= +t T

1

ˆ( ) ( ( )) ( ) 2

e

u e = − R g− x t Φ W t +p

2

1

ˆ( ) ( ( )) ˆ( ) 2

e

d e k x t Φ W t p

ρ

e t T+ =x t T+ −x t T+

Update W tˆ ( )& in Eq 27

Figure 2 The block diagram of the algorithm.

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

Times (s)

W11

W21

W31

W41

W13

W23

W33

W43

Figure 3 Convergence of parameters of the critic neural networks (NNs) in algorithms using one and three NNs.

Trang 8

S(q) =

cos u 0

sin u 0

2

6

3 7

5, M = m0 I0

1

, B =

1

r 1

b 1

r 1

1

r 1 b 1

r 1

" #

,

qrd=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi cos2t + 4cos2(2t) p

2 sin t cos (2t) 4 sin (2t) cos t

(cos2t + 4cos2(2t))

!

Then f (x), g(x) and k(x) are identified by changing these

para-meters to the formulations in Definitions 1 and 3 The

smooth desired eight-shaped trajectory qd= x½ d, yd, udT is

generated by qrd and satisfied the constraint in Definition 2

The weight vector of critic NN is defined as

^

W = ^½w1, ^w2, , ^w15T, whose initial values are zeros The

adaptive gains are selected as a1= 100 and a2= 0:01 The

activation function vector of critic NN with 15 elements is

chosen as F(e) = e2

x, exey, exeu

, exeq, exev, e2

y, eyeu, eyeq,

eyev, e2, eueq, euev, e2

q, eqev, e2

vT One selects R = I 2 <4 3 4,

Q = I2 <5 3 5 and r = 5 The PE condition is applied by adding the noise e0:005trand(t) to the control inputs and dis-turbance where rand(t) generates random signals in the range [21,1] The desired position vector of the virtual robot is ini-tial at qd(0) = x½d, yd, udT=½0, 0, p=6T The initial position and velocities of WMR are q(0) =½0:5, 0:5, 0Tm, q(0) =½0, 0T, respectively The parameter T is chosen as 0:01 s and Tstop= 800 s

The convergence of critic parameters is shown in Figure 4

It can be seen that almost parameters converge after 300 s The PE noise can be cancelled any time after that; here it is after 500 s The evolution of the posture tracking errors dur-ing the simulation is presented in Figure 5 Although affected

by input disturbances, the errors still converge closely to zero Posture tracking of the WMR versus the reference robot by the designed robust direct adaptive controller is shown in Figure 6 The evolution of tracking errors between virtual control velocities and the WMR during simulation is shown

in Figures 7 and 8, while Figure 9 represents the actual and virtual velocities

0 100 200 300 400 500 600 700 800

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Time (s)

W1

W2

W3

W4

W5

W6

W7

W8

W9

W10

W11

W12

W13

W14

W15

Figure 4 Convergence of parameters of the critic neural network

(NN).

0 100 200 300 400 500 600 700 800

-2

-1

0

1

2

3

Time (s)

ex

ey

etheta

Figure 5 Evolution of the posture tracking errors during simulation.

-1.5 -1 -0.5 0 0.5 1 1.5 -1.5

-1 -0.5 0 0.5 1 1.5

x (m)

WMR Virtual robot

Figure 6 Posture of wheeled mobile robot (WMR) with input disturbances.

-4 -2 0 2 4

Time (s)

e v

Figure 7 Evolution of the linear velocity tracking error.

Trang 9

The paper presents a new method for designing an integrated

kinematic and dynamic intelligent tracking control algorithm

for a WMR The designed algorithm is a synchronous policy

iteration using the actor critic structure with a single NN

Closed-loop dynamic tracking errors and critic parameters

are proved to show UUB stability during the online learning

The optimal value function, the robust direct adaptive control

input and worst-case disturbance are converged to the

opti-mal approximate values

Funding

This research received no specific grant from any funding

agency in the public, commercial, or not-for-profit sectors

References

Abu-Khalaf M and Lewis FL (2005) Nearly optimal control laws for

nonlinear systems with saturating actuators using a neural

net-work HJB approach Automatica 41(5): 779–791.

Chen BS, Uang HJ and Tseng CS (1998) Robust tracking

enhance-ment of robot systems including motor dynamics: a fuzzy-based

dynamic game approach IEEE Transactions on Fuzzy Systems

6(4): 538–552.

Chen H, Ma M, Wang H, et al (2009) Moving horizon H N tracking control of wheeled mobile robots with actuator saturation IEEE Transactions on Control Systems Technology 17(2): 449–457 Chwa D (2010) Tracking control of differential-drive wheeled mobile robots using a backstepping-like feedback linearization IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 40(6): 1285–1295.

Dierks T and Jagannathan S (2010) Optimal control of affine non-linear continuous-time systems using an online Hamilton–Jacobi– Isaacs formulation In: 49th IEEE Proceedings of the CDC2010 (pp 3048–3053).

Fierro R and Lewis FL (1998) Control of a nonholonomic mobile robot using neural networks IEEE Transactions on Neural Net-works 4: 589–600.

Finlayson BA (1990) The Method of Weighted Residuals and Varia-tional Principles New York: Academic Press.

Hornik K, Stinchcombe M and White H (1990) Universal approxima-tion of an unknown mapping and its derivatives using multilayer feedforward networks Neural Networks 3: 551–560.

Khoshnam S, Alireza MS and Ahmadrez T (2011) Adaptive feedback linearizing control of nonholonomic wheeled mobile robots in presence of parametric and nonparametric uncertainties Robotics and Computer-Integrated Manufacturing 27(1): 194–204.

Lewis FL, Jagannathan S and Yesildirek A (1999) Neural Network Control of Robot Manipulators and Nonlinear Systems London: Taylor & Francis.

Lin WS and Yang PC (2008) Adaptive critic motion control design of autonomous wheeled mobile robot by dual heuristic program-ming Automatica 44: 2716–2723.

Luy NT (2012) Reinforcement learning-based optimal tracking con-trol for wheeled mobile robot In: Proceeding of the IEEE Interna-tional Conference on Cyber Technology in Automation, Control, and Intelligent Systems, pp 371–376.

Luy NT, Thanh ND, Thanh NT, et al (2010) Robust reinforcement learning-based tracking control for wheeled mobile robot In IEEE Proceedings of the ICCAE2010, Vol 1, pp 171–176 Marvin KB, Simon GF and Liberato C (2009) Dual adaptive dynamic control of mobile robots using neural networks IEEE Transac-tions on Systems, Man, and Cybernetics—part b: Cybernetics 39(1): 129–141.

Miyasato Y (2008) Adaptive H N control of nonholonomic mobile robot based on inverse optimality In: Proceedings of the American Control Conference, Seattle, WA, pp 3524–3529.

Mohareri O, Dhaouadi R and Rad AB (2012) Indirect adaptive track-ing control of a nonholonomic mobile robot via neural networks Neurocomputing 88: 54–66.

Vamvoudakis KG and Lewis FL (2010) Online actor critic algorithm

to solve the continuous-time infinite horizon optimal control prob-lem Automatica 46: 878–888.

Vamvoudakis KG and Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton– Jacobi equations Automatica 47: 1556–1569.

Vamvoudakis KG, Vrabie D and Lewis FL (2011) Online learning algorithm for zero-sum games with integral reinforcement learn-ing Journal of Artificial Intelligence and Soft Computing Research 1(4): 315–332.

Van Der Shaft AJ (1992) L2-gain analysis of nonlinear systems and nonlinear state feedback H N control IEEE Transactions on Auto-matic Control 37(6): 770–784.

Vrabie D, Pastravanu O, Lewis FL, et al (2009) Adaptive optimal control for continuous-time linear systems based on policy itera-tion Automatica 45(2): 477–484.

Zenon H and Marcin S (2011) Discrete neural dynamic programming

in wheeled mobile robot control Communications in Nonlinear Science & Numerical Simulation 16 2355–2362.

-3

-2

-1

0

1

2

3

Time (s)

e w

Figure 8 Evolution of the rotation velocity tracking error.

-6

-4

-2

0

2

4

6

Time (s)

wd w

vd v

Figure 9 Actual and virtual velocities with input disturbance.

Trang 10

Appendix A: proof

Proof of the Theorem 1

Consider the Lyapunov function candidate as follows

L(t) =1

2

ð

t + T

t

a2eTedt +1

2W~

TW~ ðA:1Þ

whose derivative respect to time is given by

_L(t) = t + Tð

t

a2eT_edt + ~WTW_~ ðA:2Þ

Now, the former of the tuning law (27) without WRBis

consid-ered The negative condition in (27) is transformed as

1

2 e

T

t + Tet + T eTtet

= ð

t + T

t

eT_edt

=

ð

t + T

t

eT(f (x) + g^u + k^d)dt 0

ðA:3Þ

Let notice (21)

Q(e) = WTFef (x) +1

4W

TFeGFTeW1

4W

TFeKFTeW eHJI

ðA:4Þ Replacing Q(e) to (26), one obtains

^

H (e, ~W ) = ~WTFef (x) + ~WTFeðG KÞFTeW =2

~WTFeðG KÞFT

eW =4~ eHJI

ðA:5Þ

where W = W~ ^W is the NN weight estimation error

Transform s as

s=

ð

t + T

t

Fef (x) g^u + k^d

dt

=

ð

t + T

t

Fef (x)1

2FeðG KÞFT

eW^

dt

=

ð

t + T

t

Fef (x)1

2FeðG KÞFT

e W ~W

dt

=

ð

t + T

t

Fe f (x) + gu+ kd+1

2ðG KÞee

+1

2FeðG KÞFT

eW~Þdt ðA:6Þ Observing (A.5) and (A.6) with _~W =W ,_^ W_~1=W , the_~

error dynamics of (27) are written as

_~

W1=a1

m2

ð

t + T

t

Fe f (x) + gu+ kd+1

2ðG KÞee

:

+1

2FeðG KÞFT

eW~Þdt

3 ð

t + T

t

~

WTFe f (x) + gu+ kd+1

2ðG KÞee

:

+1

4W~

T

FeðG KÞFT

eW + e~ HJIÞdt ðA:7Þ where m = sTs+ 1 Let substitute (A.7), the dynamics (13) applying laws (24) and (25) into (A.2)

_L = ð

t + T

t

a2eTf (x) + g^u + k^d

dt

a1

m2

ð

t + T

t

~

2ðG KÞee

dt

0

@

1 A

2

a1 8m2

ð

t + T

t

~

WTFeðG KÞFT

eW dt~

0

@

1 A

2

3a1 4m2

ð

t + T

t

~

2ðG KÞee

dt

3 ð

t + T

t

~

WTFeðG KÞFT

eW dt~

a1

m2

ð

t + T

t

~

2ðG KÞee

dt ð

t + T

t

eHJIdt

a1 2m2

ð

t + T

t

~

WTFeðG KÞFT

eW dt~ t + Tð

t

It is straightforward to verify that by (A.3) there exists a posi-tive constant l0, such that

ðt + T t

a2eTf (x) + g^u + k^d

dt

ðt + T t

a2l0k kdt ðA:9Þe Substituting eHJIfrom (22) and (A.9) into (A.8) and complet-ing the square with respect tot + TÐ

t

~

WTFeðG KÞFT

eW dt and~ Ð

t + T t

~

WTFe f (x) + gu+ kd+1ðG KÞee

dt, one obtains

_L = a2l0

ð

t + T

t e

k kdt a1

m2

A

2+ C

a1

m2

B2

4 + C

3a1

m2

B2

8 + A

+9a1 4m2A2 a1

64m2B2+2a1

m2C

a2l0

ð

t + T

t e

k kdt + 9a1

4m2A2 a1

64m2B2+2a1

m2C2 ðA:10Þ

Luy NT, Thanh ND, Thanh NT, et al (2010) Robust reinforcement learning-based tracking control for wheeled mobile robot. .. wheeled mobile robot by dual heuristic program-ming Automatica 44: 2716–2723.

Luy NT (2012) Reinforcement learning-based optimal tracking con-trol for wheeled mobile robot In:

Định dạng
Số trang	11
Dung lượng	0,96 MB