DSpace at VNU: Reinforcement learning-based intelligent tracking control for wheeled mobile robot tài liệu, giáo án, bài...
Trang 1Measurement and Control
Transactions of the Institute of
http://tim.sagepub.com/content/early/2014/03/10/0142331213509828
The online version of this article can be found at:
DOI: 10.1177/0142331213509828
published online 10 March 2014
Transactions of the Institute of Measurement and Control
Nguyen Tan Luy, Nguyen Thien Thanh and Hoang Minh Tri
Reinforcement learning-based intelligent tracking control for wheeled mobile robot
Published by:
http://www.sagepublications.com
On behalf of:
The Institute of Measurement and Control
can be found at:
Transactions of the Institute of Measurement and Control
Additional services and information for
http://tim.sagepub.com/cgi/alerts
Email Alerts:
http://tim.sagepub.com/subscriptions
Subscriptions:
http://www.sagepub.com/journalsReprints.nav
Reprints:
http://www.sagepub.com/journalsPermissions.nav
Permissions:
What is This?
- Mar 10, 2014
OnlineFirst Version of Record
>>
Trang 2Transactions of the Institute of Measurement and Control 1–10
Ó The Author(s) 2014 Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0142331213509828 tim.sagepub.com
Reinforcement learning-based
intelligent tracking control for wheeled
mobile robot
Nguyen Tan Luy1, Nguyen Thien Thanh2and Hoang Minh Tri2
Abstract
This paper proposes a new method to design a reinforcement learning-based integrated kinematic and dynamic tracking control algorithm for a non-holonomic wheeled mobile robot without knowledge of the system’s drift tracking dynamics The actor critic structure in the control scheme uses only one neural network to reduce computational cost and storage resources A novel tuning law for a single neural network is designed to learn an online solution of a tracking Hamilton–Jacobi–Isaacs (HJI) equation The HJI solution is used to approximate an H N optimal tracking performance index func-tion and an intelligent tracking control law in the case of the worst disturbance The laws guarantee closed-loop stability in real time The convergence and stability of the overall system are proved by Lyapunov techniques The simulation results on a non-linear system and wheeled mobile robot verify the effectiveness of the proposed controller.
Keywords
Actor critic, Hamilton–Jacobi–Isaacs equation, neural network, wheeled mobile robot
Introduction
An important motion control problem for the system of
wheeled mobile robots (WMRs) is the trajectory tracking
This problem has been extensively studied in past few
decades Generally, a variety of control algorithms for the
trajectory tracking problem has been devoted in the form of
adaptive control (Fierro and Lewis, 1998; Marvin et al., 2009;
Mohareri et al., 2012) where the back-stepping techniques are
used The kinematic controllers are designed using the
avail-able models, and dynamic controllers are designed based on
neural networks (NNs) They are considered indirect adaptive
controllers Besides, they do not minimize any long-term
per-formance function and hence are not optimal HNadaptive
control for a WMR based on inverse optimality is proposed
in Miyasato (2008) but it is an offline control scheme A
spe-cific characteristic of the WMR models is that it can be
pre-sented as a non-linear system in a strict-feedback form, but
until now, to the best knowledge of the authors, methods of
tracking control for a WMR using this form are just
consid-ered in adaptive back-stepping (Chwa, 2010) or adaptive
feed-back linearization schemes (Khoshnam et al., 2011) without
any optimality
In the other direction, thanks to the abilities of online
adaptive learning of reinforcement learning (RL) methods in
optimal control, tracking control methods for WMRs have
been studied The adaptive critic structures in RL are
exploited to learn discrete controllers (Lin and Yang, 2008;
Zenon and Marcin, 2011) or a continuous controller without
disturbance using the learned solution of the Hamilton–
Jacobi–Bellman (HJB) equation (Luy, 2012) These
controllers not only overcome the drawbacks of the other methods such as the domain expert of fuzzy or existing con-trollers to generate a training sample for NNs, but also opti-mize utility functions, in contrast to the tracking error at the current time instant in the NN-based adaptive controllers However, these methods have access to the known explicit model of WMR and ignored the disturbance, so they are not
a type of robust adaptive control method
To control a non-linear system, i.e a WMR system with optimality related to disturbances using RL, the solutions of Hamilton–Jacobi–Isaacs (HJI) in the H‘ optimal control problem must be learned (Dierks and Jagannathan, 2010) The integral RL-based direct adaptive control algorithm for a class of general non-linear system has been studied in Vamvoudakis et al (2011) to solve the HJI equation The most favourable part of this algorithm is that NNs can be trained synchronously to approximate optimal control input and worst-case disturbance without knowledge of the system drift dynamics terms However, it requires three NNs in the same structure – one for the critic and the others for actors The number of neurons in the hidden layers should be at least
1
Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam
2
Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam Corresponding author:
Nguyen Tan Luy, Division of Automation Electronics, 305A BaHom District, Kienthanh Building, Ho Chi Minh City, Ho Chi Minh 70000, Vietnam.
Email: luynguyentan@yahoo.com
Trang 3(n + 1)n/2, where n is number of state variables In practical
applications, e.g robotics, the number of state variables
mea-sured from sensors for feedback may be relatively large With
three NNs, the number of NNs weights and the activation
functions representing the elements in combination of the states
will significantly increase If applied directly, the algorithm to a
non-linear system may lead to the computational complexity
and resource consumption In contrast, a method using a single
online approximator (SOLA) in Dierks and Jagannathan
(2010) to solve the HJI equation can reduce the number of
NNs but, unfortunately, it is a type of model-based RL
From the aforementioned problems, there are three main
contributions in the paper The first involves the derivation of a
tracking dynamics formed from a non-linear strict-feedback
model of WMR the purpose of which is to design an integrated
kinematic and dynamic control RL based-intelligent controller,
i.e the integrated kinematic and dynamic robust direct
adap-tive tracking controller with optimality without explicit
knowl-edge of the system’s drift dynamics The actor critic structure in
the RL scheme uses only one NN for the critic law Secondly,
the last contribution is the tuning law for the critic NN so that
solutions of the tracking HJI equation are learned, and
optim-ality values of the tracking performance index function and the
robust direct adaptive control law as well as the worst-case
dis-turbance law are approximated without accessing the system’s
drift dynamics By Lyapunov techniques, the closed-loop
sys-tem state and critic NN error are proved to be uniform ultimate
bounds and system parameters show convergence to optimal
target values asymptotically
The paper is organized as follows The next section
pro-vides the theoretical background of the WMR to establish the
non-linear WMR system in the strict-feedback form and then
the new tracking dynamics is derived Then we design the
inte-grated kinematic and dynamic robust direct adaptive tracking
control scheme with optimality along with tuning law for the
critic NN and give proof of stability and convergence The
results of simulation on the WMR verify the effectiveness of
the proposed algorithm and conclusions are drawn
Strict-feedback kinematic and dynamic
model
A WMR with differentially driven wheels mounted on a
driv-ing axle can move and rotate on the horizontal plane thanks
to two independent actuators Torque from the actuators is
transmitted to the left and the right wheels to drive the robot
The mass of the WMR including the mass of the platform
without the wheels and the mass of the wheels is focused on a
central point The distance of the driven wheels is b1 The
radius of each wheel is r1 The distance from the centre point
to the driving axe is l Without loss of generality, it can be
assume that l=0 The WMR is considered a mechanical
sys-tem with n generalized configuration variables q suffering m
constraints (m\n) and represented by the equation as follows
(Khoshnam et al., 2011)
Hk, j(q, _q) =Xn
i = 1
hk, ji(q, _q) _q = 0, j = 1, , m ð1Þ
where the number of holonomic and non-holonomic con-strains are k and m2k, respectively The concon-strains are inde-pendent of time and can be written as Ak(q) _q = 0, where
Ak2 <m 3 nis a full-rank matrix Assume that S(q)2 <n 3 (nm)
is also a full-rank matrix that is formed from a set of smooth and linearly independent vector fields in the null space of Ak such that Ak(q)S(q) = 0 Let q(t)2 <nmbe the velocity vec-tor, which can be seen as the pseudo-control vecvec-tor, and is important to form a strict-feedback non-linear system after-wards The kinematic equation of WMR motion can be writ-ten as
To derive a dynamic equation of WMR, Lagrange formalism
is used as follows
d dt
∂L
∂ _q
∂L
where FT is the vector of the generalized forces The WMR moves on a plane so Lagrangian L only includes kinetic energy
L =1 2
Xn i
i = 1
vT
imivi+ vT
iIivi ð4Þ
where vi, vi, miand Iiare elements of the linear velocity, the rotation velocity, the mass and the moment of inertia, respec-tively As a result, the dynamic model of the WMR is expressed as
M (q)€q + C(q, _q) _q + B(q)F( _q) + B(q)td= B(q)t AT
k(q)l ð5Þ where M(q)2 <n 3 n is the symmetric positive defined inertia matrix, C(q, _q)2 <n 3 nis the centripetal and Coriolis matrix, F( _q)2 <n 3 1 is the surface friction and gravitational vector,
td2 <(nm) 3 1 denotes the bounded unknown disturbances including unstructured unmodelled dynamics, B(q)2
<n 3 nm ð Þis the input transformation matrix, t2 <(nm) 3 1 is the input torque vector and l2 <m 3 1 is the vector of con-straint forces Taking the time derivative of the kinematic model (2), one obtains
€
q = _S(q)q + S(q) _q ð6Þ Substituting (2), (6) into (5) and multiplying both sides of the result by STð Þ and note that Aq k( _q)S(q) = 0, one obtains
M(q) _q(t) + C(q, _q)q(t) + F( _q) + td= B(q)t ð7Þ where M (q) = STMS, C(q, _q) = STMS + STCS, B(q) = STB,
F( _q) = STM _Sq + BF, td= Btd Definition 1 Letting fq(q) = 0n 3 1, gq(q) = S(q)2 <n 3 nm ð Þ,
fq(q, q) = M1(q) ðC( _q, q)q + F( _q)Þ 2 <(nm) 31, gq(q, q) =
M1(q)B(q)2 <(nm)3 (nm), kq(q, q) = M1(q)2 <(nm)3(nm) The state space equation of WMR represented the non-linear system in the strict-feedback form is obtained by using kinematics and dynamics Equations (2) and (7)
Trang 4_q = fq(q) + gq(q)q
_
q= fq(q, q) + gq(q, q)t + kq(q, q)td
(
ð8Þ
The system (8) is assumed to be controllable and drift
free with (q, q) = 0, a unique equilibrium point on a compact
set X 2 <2nm Let us view some following important
properties
Property 1.M (q) is a bounded asymmetric and positive
defi-nite matrix such that mmin kM qð Þk mmax, where mmin and
mmaxare positive scalar constants
Property 2 C(q) is bounded such that cmin C(q) cmax,
where cminand cmaxare positive scalar constants
Property 3 The disturbance td is bounded such that
td
k k td max, where td maxis a positive scalar constant
Property 4.fq(q, q) is the system uncertainty dynamics term
and fqðq, qÞ cmaxm1mink k.q
Property 5.gqð Þ is bounded such that gq min g q(q) gmax,
where gminand gmaxare positive scalar constants
Property 6.gq(q, q) is bounded such that m1maxB gk q(q, q)k
m1minB, where B(q), the constant non-singular matrix,
depends on the geometric parameter of the WMR, i.e the
radius r1of wheels and the robot frame width b1(Khoshnam
et al., 2011), and according to Property 1, gq(q, q)6¼ 0
Property 7 kq(q, q) is bounded such that m1
max kk q(q, q)k
m1minand according to Property 1 kq(q, q)6¼ 0
Property 8.fq(q, q), gq(q), gq(q, q) and kq(q, q) are non-linear
smooth functions
Definition 2 If a reference robot generates the bounded
smooth trajectory vector that satisfies the constraint
_qd= S(qd)qrd, where qrd is the smooth velocity vector, the
main objective for the robust adaptive tracking control
prob-lem for WMR is to design integrated kinematic and dynamic
feedback control laws for the dynamic system (8) where
con-tains the uncertainty terms and disturbance, such that when
t! 0, then eq! 0 with eq= q qd Furthermore, a defined
tracking cost function related to (8) must be optimized
To have tracking dynamics for designing integrated
kine-matic and dynamic feedback control law, some steps to
change model (8) will be executed The first equation in (8) is
written as
_q _qd= _eq= _qd+ gq(q)qd+ gq(q)(q qd)
= gq(q)qd+ gq(q)eq
ð9Þ
where eq= q qd2 <nm, qd2 <nmis virtual control input
such that qd= qd+ qdawith qd2 <nmis an optimal
track-ing control input vector designed later, and qda, the
feed-forward virtual control input, is the solution of the equation
0 = _qd+ gq(q)qda ð10Þ Similarly, the last equation in (8) is written as
_
q _qd= _eq= _qd+ fq(q, q) + gq(q, q)t + kq(q, q)td
= fq(q, q) + gq(q, q)t gT
q(q)eq+ kq(q, q)td ð11Þ where tis the tracking control input designed later such that
t= t+ taand tais a solution of the equation
0 = _qd+ gq(q, q)ta+ gTq(q)eq ð12Þ
Definition 3 Letxd= q T, qTdT
2 <(2nm) 3 1, x = q T, qTT
2
<(2nm) 3 1, e = eT, eT
q
h iT
2 <2nm, f (x) = 0n 3 1, fT
q(q, q)
2
<2nm, u= u ua, u= q Td , tTT
2 <2(nm) 3 1, ua=
qTda,tT
T
2<2(nm) 31, g(x)=diag½gq(q),gq(x)2<(2nm)32(nm), k(x)=diag k q(q),kq(q,q)
2<(2nm)32(nm), kq(q)=0n 3(nm),
d = 0 1 3(nm),tTT
2<2(nm)31 Lemma 1 Consider the tracking dynamics of the WMR as follows
_e = f (x) + g(x)u+ k(x)d ð13Þ
If the control law ufor (13) is designed, it can be the control law for (8), that means the control law ufor (13) and (8) is equivalent
Proof For (8), choosing the Lyapunov function candidate as
J = eTe=2 and taking the derivative along with (9) and (11), one obtains
_J = eTfq+ eTgqqd+ eTgqeq+ eT
qfq+ eT
qgqt
eT
qgTeq+ eT
qkqtd
= eT(fq+ gqqd) + eT
q(fq+ gqt+ kqtd)
= eTðf (x) + g(x)u+ k(x)dÞ
ð14Þ
Comparing (14) and (13), it can be seen that the control law
ufor (13) and (8) is equivalent
This completes the proof
Remark 1 If control law u exists, it will be the integrated kinematic and dynamic control law as opposed to back-stepping control laws where kinematic and dynamic control inputs are separated
Remark 2 f (x) represents system’s drift tracking dynamics and d is the bounded unknown disturbance according to Property 3
Fact 1.k k q qq k dk = ek k, thus by Property 4, f (x) is con-strained by f (x) m1mincmaxk k.e
RL-based intelligent tracking control algorithm
According to the defined objective, applying and developing the policy iteration (PI) algorithm of RL for system (13) is an appropriate choice RL can be used to learn online HJB solu-tions for optimal control problems (Vamvoudakis and Lewis, 2010) and HJI solutions for the HN optimal problems
Trang 5(Vamvoudakis and Lewis, 2011; Vamvoudakis et al., 2011).
Define a value function based on the HN tracking
perfor-mance index function object to (13) (Chen et al., 1998; Chen
et al., 2009; Luy et al., 2010):
V =
ð‘ t r(e(t), u(t), d(t))dt ð15Þ
where r(e, u, d) = Q(e) + uTRu r2dTd, with Q(e) is positive
definite, i.e.8e 6¼ 0 Q(e) 0 and e = 0, ) Q(e) = 0, u is the
admissible control input that minimizes V while d tries to
maximize V (Dierks and Jagannathan, 2010; Vamvoudakis
and Lewis, 2011), R is a symmetric positive definite matrix,
r ris the prescribed disturbance attenuation level, where
r 0 is the minimum gain of r for which the stability of
closed-loop tracking system (13) is guaranteed (Van Der
Shaft, 1992) Define the Hamiltonian of (15) associated with
uand d as
H (e, u, d, Ve) = r(e, u, d) + VT
eðf (x) + g(x)u + k(x)dÞ ð16Þ where Ve= ∂V (e)=∂e There exists a minimum non-negative
local smooth solution of (16) (Dierks and Jagannathan, 2010;
Vamvoudakis et al., 2011) If V
e is that solution and (13) is locally detectable, then the Nash equilibrium solutions in term
of V
e can be found by the stationary condition of (16), i.e u
and d
u(e) =1
2R
1gT(x)Ve ð17Þ
d(e) = 1 2r2kT(x)Ve ð18Þ where V
e= ∂V(e)=∂e The tracking HJI equation is obtained
by substituting (17) and (18) into (16):
Q(e) + VeTf (x)1
4V
T
e g(x)R1gT(x)Ve + 1
4r2VeTk(x)kT(x)Ve= 0, V= 0
ð19Þ
Solutions HJI of (19) can be learned without explicit
knowl-edge of the system’s drift dynamics by an integral RL-based
PI algorithm where three NNs for the actor critic, which are
the same structure, are required (Vamvoudakis et al., 2011)
Using three NNs may lead to the computational complexity
and resource consumption when applying for multivariable
systems such as the WMR system defined earlier Therefore,
in this paper, the new actor critic scheme is proposed for the
tracking problem using only one NN with the purpose of
reducing the cost of computation and storage resources The
critic with the NN to approximate the optimal value function
(15) is defined as
V(e) = WTF(e) + e(e) ð20Þ where F(e) :<n! <N is the activation function vector, N is
the number of neurons in the hidden layer, e(e) is the NN
approximation error and W2 <N is the NN ideal weight
vector F(e) can be selected such that, N! ‘, e(e) ! 0 and
ee(e) = ∂e(e)=∂e! 0, and for fixed N, ke(e)k\emax,
ee(e)
k k\ee max where emax and ee max are positive constants (Finlayson, 1990) Let us substitute (19), (20) into (16) to obtain the NN-based HJI equation
Q(e) + WTFef (x)1
4W
TFeGFTeW
+1
4W
TFeKFTeW + eHJI= 0
ð21Þ
where G = gR1gT, K = kkT
r2, Fe(e) = ∂F(e)=∂e and eHJI is the residual error formed by the NN approximation error
eHJI= eT
ef1
2W
TFeðG KÞee1
4e
T
eðG KÞee
= eT
e(f (x) + gu+ kd) +1
4e
T
eðG KÞee
ð22Þ
when N! ‘, eHJI converges uniformly to zero For fixed N ,
eHJIis bounded on a compact set (Vamvoudakis et al., 2011) Fact 2 According to Properties 5 and 6,G is bounded such that 0 Gmin G Gmax, Gmin= g2
minsmax(R) and Gmax=
g2maxsmin(R), with smin(R) and smin(R) are the largest and smallest eigenvalues of R, respectively
Fact 3 According to Properties 7, K is bounded such that
0 Kmin K Kmax, Kmin= k2
min
r2, Kmax= k2
max
r2 Assumption 1 The closed-loop tracking dynamics of WMR is bounded such that f (x) + guk + kdk gmaxfor the positive constant gmax
The ideal weight vector W (20) is unknown, thus V (e) is val-ued by
^
V (e) = ^WTF(e) ð23Þ Then, the estimated control and disturbance laws become
^ u(e) =1
2R
1g(x)TFTeW^ ð24Þ
^ d(e) = 1 2r2kT(x)FT
The approximate Hamiltonian is obtained by substituting (23), (24) and (18) into (16)
^ H(e, ^W ) = Q(e) + ^WTFef (x) ^WTFeGFTeW =4^ + ^WTFeKFTeW =4^
ð26Þ
Observing Equations (21) and (26), it is straightforward to see that ^W should be tuned to minimize the subject error function related to ^H(e, ^W ) To design a tuning law for ^W that does not depend on f (x), the error function is chosen as E =1
2eT
HeH, where eH=Ðt + T
t H(e, ^^ W )dt with T 0 is a chosen sampling time Then, the tuning law becomes W =_^ a1∂E∂ ^W In addition, due to the approximation error during online learn-ing, it is desired to design the tuning law of ^W such that it not only minimizes E but also guarantees the stabilization of the
Trang 6system, concurrently If more than one NN is used, the tuning
law of the critic NN is responsible for minimizing E, while the
tuning laws of actor NNs guarantee the robust stability for
the overall system In our case, only one NN is used and thus
both objectives must be intergraded into one, i.e
_^
W =
_^
t + Tet + T eT
tet
WRB+W_^1 Otherwise
8
<
:
ð27Þ where et= e(t), et + T= e(t + T ), and
_^
W 1 = a 1
s
s T s + 1
ð Þ 2
ð
t + T
t
Q(e) +1
4W^
T
F e GFTeW^1
4W^
T
F e KFTeW^
dt + DFTð e(t) Þ ^ W
0
@
1 A
ð28Þ
WRB=a2
2Fe(G K)e ð29Þ where
s=
ð
t + T
t
Fef (x) + g^u + k^d
dt = ð
t + T
t
Fe_edt
=
ð
t + T
t
d F e(t)ð ð ÞÞ = F(et + T) F(et) = DF e(t)ð Þ:
It will be shown in the proof of Theorem 1 that along with
_^
W1in (27) and the added term, WRB, we will guarantee that
the closed-loop system is uniform ultimately bounded (UUB;
Lewis et al., 1999) when the behaviour of the overall system
becomes unstable
The proposed actor critic structure to learn and feedback
control online is shown in Figure 1 It can be seen that the
adaptive tuning law for the single NN in (27) is applied to
update the NN weight such that the error function of
approxi-mate Hamiltonian in (26) is minimized and it does not involve
the system’s drift dynamics, so the intelligent tracking control law defined previously can be obtained
To guarantee the convergence of ^W , the control inputs and disturbance must be fully explored by adding the noise probe
to ^u(e) and ^d(e) That means the Persistence of Excitation (PE) condition in the interval t, t + T½ P with TP 0, for all t must be satisfied (Vamvoudakis et al., 2011)
b1I ð
t + T p
t
sT(t)s(t)dt b2I ð30Þ
where b1and b2are positive constants, s= 1=(sTs+ 1) and
Iis the identity matrix with the appropriate dimension Theorem 1 Let the tracking dynamics of WMR be given by (13) with the objective tracking HJI equation (19), critic NN
be given by (20), the tuning law for critic be defined in (27) and the intelligent tracking control and disturbance laws to approximate the H‘ optimal tracking cost function (15) be defined in (24) and (25), sis satisfied with the condition PE (30) Then, the closed-loop system state e and the NN error
~
W are UUB with the limited number of hidden layer units Furthermore, the approximation errors of control input and worst-case disturbance are bounded such that uk ^uk\eu,
d ^d
\edfor small positive constants eu, ed Proof See Appendix A
The proposed algorithm is represented by the block dia-gram in the Figure 2 Tstopis the time to stop the algorithm, p
is the noise probe and the other parameters are mentioned before
Simulation results
To verify the proposed algorithm, two numerical simulations are offered In the former, a non-linear system is learned and controlled by the proposed algorithm using one NN in com-parison with another one using three NNs (Vamvoudakis et al., 2011) In the latter, the proposed algorithm is applied for the WMR
d x
( )
WMR (Eq 13)
ACTORS
( )
g x
( )
ˆ
W&
,
G K α α1, 2
Q
CRITIC
Disturbance law (Eq 25)
ˆ
W
Φ
e
ˆ
u
ˆ
d
ˆ
T
Φ
+
NN (Eq.23)
R
Adaptive tuning law
Control law (Eq 24)
ˆ ˆT
V =W Φ Wˆ
Φ
( )
k x
( )
g x
Figure 1 The proposed actor critic structure.
Trang 7Non-linear system
Consider the non-linear system with disturbance inputs, with
a quadratic cost defined as in Vamvoudakis et al (2011):
_x = f (x) + g(x)u + k(x)d ð31Þ where f (x) =hx1+ x2, x3 x3+ 0:25x2ðcos (2x1+ 2)Þ2
0:25
r2
x2ðsin (4x1+ 2)Þ2Tg(x) = 0, cos (2x½ 1) + 2T and
k(x) = 0, sin (4x½ 1) + 2T We simulate in turn the non-linear
system by the proposed algorithm using one NN and the
algorithm in Vamvoudakis et al (2011) using three NNs To
be comparable, the optimal tracking problem in the paper is
transformed to the optimal control problem as Vamvoudakis
et al (2011), by defining the vector of tracking error as
e = x xd, where xd= 0 In this case, the tracking dynamic
Equation (13) is in the form of Equation (31)
In both algorithms, one selects Q = 1 0
0 1
, R = 1, r = 8,
a1= 50, a2= 0:01, T = 0:05 and Tstop=80 s The optimal
value function is V(x) =1x4+1x2, so the optimal inputs
are u(x) =1
2( cos (2x1) + 2)x2 and d(x) = 1
2r 2
( sin (4x1) + 2)x2 by theory The NN activation function
vectors are defined as F(x) = x 2x2x4x4T
and the weight vector of critic NNs are defined as ^W1= ^W1W^1W^1W^1T for the algorithm using one NN and ^W3= ^W3W^3W^3W^3T for one using three NNs All initial values of weights are zeros The other parameters for three NNs can be seen in Vamvoudakis et al (2011)
The convergence of critic parameters of both are shown in Figure 3 In the algorithm using one NN, all parameters con-verge at about 20 s with optimal values ^W1= 0:006½ 0:5 0:2483 0T, while using three NNs, they converge slower,
at about 50 s with optimal values W^3= 0:005½ 0:5 0:2437 0T In addition, the parameters of the NNs for the actor and disturbance in the algorithm using three NNs also converged to the optimal approximate values (see Vamvoudakis
et al., 2011, for more detail) Thus, using (24) and (25), both algorithms give similarly the optimal control inputs u(x) and the optimal disturbance inputs d(x) However, it can be seen that using single NN, the proposed algorithm has reduced the complexity and resources, and given the convergence speed faster than the algorithm using three NNs
Wheeled mobile robot
Consider the WMR defined above With the notation intro-duced before, state vectors and parameters of WMR are
q = x, y, u½ T, q = v, v½ T, r1= 0:05 m, b1= 0:5 m and l = 0,
m = 10 kg, I1= 5 kg:m2where m and I1denote the value of the mass and the moment of inertia of the platform, motors and wheels, respectively Note that with the designed robust adaptive control law, WMR parameters can change online in bounded domains One assumes that the control torques t applied to DC motor-mounted gearboxes are statically related to the voltage input by a constant so the electrical dynamics of the motors can be included in the general distur-bance td such that tk k 3 N :m If the WMR matrices andd desired velocities of the reference robot are defined as C(q, _q) _q = ml2_u2½cos u, sin u, 0T,
Start
( ( ))e t
Φ , Vˆ =Wˆ ( ) ( ( ))T tΦ e t
False
Stop
stop
t T<
True WMR (Eq 13)
Assign T stop, , ,p T Q R, , , ρ α α1, 2
Init x , 0 e , 0 Wˆ0=0, t=0
t= +t T
1
ˆ( ) ( ( )) ( ) 2
e
u e = − R g− x t Φ W t +p
2
1
ˆ( ) ( ( )) ˆ( ) 2
e
d e k x t Φ W t p
ρ
e t T+ =x t T+ −x t T+
Update W tˆ ( )& in Eq 27
Figure 2 The block diagram of the algorithm.
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
Times (s)
W11
W21
W31
W41
W13
W23
W33
W43
Figure 3 Convergence of parameters of the critic neural networks (NNs) in algorithms using one and three NNs.
Trang 8S(q) =
cos u 0
sin u 0
2
6
3 7
5, M = m0 I0
1
, B =
1
r 1
b 1
r 1
1
r 1 b 1
r 1
" #
,
qrd=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi cos2t + 4cos2(2t) p
2 sin t cos (2t) 4 sin (2t) cos t
(cos2t + 4cos2(2t))
!
Then f (x), g(x) and k(x) are identified by changing these
para-meters to the formulations in Definitions 1 and 3 The
smooth desired eight-shaped trajectory qd= x½ d, yd, udT is
generated by qrd and satisfied the constraint in Definition 2
The weight vector of critic NN is defined as
^
W = ^½w1, ^w2, , ^w15T, whose initial values are zeros The
adaptive gains are selected as a1= 100 and a2= 0:01 The
activation function vector of critic NN with 15 elements is
chosen as F(e) = e2
x, exey, exeu
, exeq, exev, e2
y, eyeu, eyeq,
eyev, e2, eueq, euev, e2
q, eqev, e2
vT One selects R = I 2 <4 3 4,
Q = I2 <5 3 5 and r = 5 The PE condition is applied by adding the noise e0:005trand(t) to the control inputs and dis-turbance where rand(t) generates random signals in the range [21,1] The desired position vector of the virtual robot is ini-tial at qd(0) = x½d, yd, udT=½0, 0, p=6T The initial position and velocities of WMR are q(0) =½0:5, 0:5, 0Tm, q(0) =½0, 0T, respectively The parameter T is chosen as 0:01 s and Tstop= 800 s
The convergence of critic parameters is shown in Figure 4
It can be seen that almost parameters converge after 300 s The PE noise can be cancelled any time after that; here it is after 500 s The evolution of the posture tracking errors dur-ing the simulation is presented in Figure 5 Although affected
by input disturbances, the errors still converge closely to zero Posture tracking of the WMR versus the reference robot by the designed robust direct adaptive controller is shown in Figure 6 The evolution of tracking errors between virtual control velocities and the WMR during simulation is shown
in Figures 7 and 8, while Figure 9 represents the actual and virtual velocities
0 100 200 300 400 500 600 700 800
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Time (s)
W1
W2
W3
W4
W5
W6
W7
W8
W9
W10
W11
W12
W13
W14
W15
Figure 4 Convergence of parameters of the critic neural network
(NN).
0 100 200 300 400 500 600 700 800
-2
-1
0
1
2
3
Time (s)
ex
ey
etheta
Figure 5 Evolution of the posture tracking errors during simulation.
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5
-1 -0.5 0 0.5 1 1.5
x (m)
WMR Virtual robot
Figure 6 Posture of wheeled mobile robot (WMR) with input disturbances.
-4 -2 0 2 4
Time (s)
e v
Figure 7 Evolution of the linear velocity tracking error.
Trang 9The paper presents a new method for designing an integrated
kinematic and dynamic intelligent tracking control algorithm
for a WMR The designed algorithm is a synchronous policy
iteration using the actor critic structure with a single NN
Closed-loop dynamic tracking errors and critic parameters
are proved to show UUB stability during the online learning
The optimal value function, the robust direct adaptive control
input and worst-case disturbance are converged to the
opti-mal approximate values
Funding
This research received no specific grant from any funding
agency in the public, commercial, or not-for-profit sectors
References
Abu-Khalaf M and Lewis FL (2005) Nearly optimal control laws for
nonlinear systems with saturating actuators using a neural
net-work HJB approach Automatica 41(5): 779–791.
Chen BS, Uang HJ and Tseng CS (1998) Robust tracking
enhance-ment of robot systems including motor dynamics: a fuzzy-based
dynamic game approach IEEE Transactions on Fuzzy Systems
6(4): 538–552.
Chen H, Ma M, Wang H, et al (2009) Moving horizon H N tracking control of wheeled mobile robots with actuator saturation IEEE Transactions on Control Systems Technology 17(2): 449–457 Chwa D (2010) Tracking control of differential-drive wheeled mobile robots using a backstepping-like feedback linearization IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 40(6): 1285–1295.
Dierks T and Jagannathan S (2010) Optimal control of affine non-linear continuous-time systems using an online Hamilton–Jacobi– Isaacs formulation In: 49th IEEE Proceedings of the CDC2010 (pp 3048–3053).
Fierro R and Lewis FL (1998) Control of a nonholonomic mobile robot using neural networks IEEE Transactions on Neural Net-works 4: 589–600.
Finlayson BA (1990) The Method of Weighted Residuals and Varia-tional Principles New York: Academic Press.
Hornik K, Stinchcombe M and White H (1990) Universal approxima-tion of an unknown mapping and its derivatives using multilayer feedforward networks Neural Networks 3: 551–560.
Khoshnam S, Alireza MS and Ahmadrez T (2011) Adaptive feedback linearizing control of nonholonomic wheeled mobile robots in presence of parametric and nonparametric uncertainties Robotics and Computer-Integrated Manufacturing 27(1): 194–204.
Lewis FL, Jagannathan S and Yesildirek A (1999) Neural Network Control of Robot Manipulators and Nonlinear Systems London: Taylor & Francis.
Lin WS and Yang PC (2008) Adaptive critic motion control design of autonomous wheeled mobile robot by dual heuristic program-ming Automatica 44: 2716–2723.
Luy NT (2012) Reinforcement learning-based optimal tracking con-trol for wheeled mobile robot In: Proceeding of the IEEE Interna-tional Conference on Cyber Technology in Automation, Control, and Intelligent Systems, pp 371–376.
Luy NT, Thanh ND, Thanh NT, et al (2010) Robust reinforcement learning-based tracking control for wheeled mobile robot In IEEE Proceedings of the ICCAE2010, Vol 1, pp 171–176 Marvin KB, Simon GF and Liberato C (2009) Dual adaptive dynamic control of mobile robots using neural networks IEEE Transac-tions on Systems, Man, and Cybernetics—part b: Cybernetics 39(1): 129–141.
Miyasato Y (2008) Adaptive H N control of nonholonomic mobile robot based on inverse optimality In: Proceedings of the American Control Conference, Seattle, WA, pp 3524–3529.
Mohareri O, Dhaouadi R and Rad AB (2012) Indirect adaptive track-ing control of a nonholonomic mobile robot via neural networks Neurocomputing 88: 54–66.
Vamvoudakis KG and Lewis FL (2010) Online actor critic algorithm
to solve the continuous-time infinite horizon optimal control prob-lem Automatica 46: 878–888.
Vamvoudakis KG and Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton– Jacobi equations Automatica 47: 1556–1569.
Vamvoudakis KG, Vrabie D and Lewis FL (2011) Online learning algorithm for zero-sum games with integral reinforcement learn-ing Journal of Artificial Intelligence and Soft Computing Research 1(4): 315–332.
Van Der Shaft AJ (1992) L2-gain analysis of nonlinear systems and nonlinear state feedback H N control IEEE Transactions on Auto-matic Control 37(6): 770–784.
Vrabie D, Pastravanu O, Lewis FL, et al (2009) Adaptive optimal control for continuous-time linear systems based on policy itera-tion Automatica 45(2): 477–484.
Zenon H and Marcin S (2011) Discrete neural dynamic programming
in wheeled mobile robot control Communications in Nonlinear Science & Numerical Simulation 16 2355–2362.
-3
-2
-1
0
1
2
3
Time (s)
e w
Figure 8 Evolution of the rotation velocity tracking error.
-6
-4
-2
0
2
4
6
Time (s)
wd w
vd v
Figure 9 Actual and virtual velocities with input disturbance.
Trang 10Appendix A: proof
Proof of the Theorem 1
Consider the Lyapunov function candidate as follows
L(t) =1
2
ð
t + T
t
a2eTedt +1
2W~
TW~ ðA:1Þ
whose derivative respect to time is given by
_L(t) = t + Tð
t
a2eT_edt + ~WTW_~ ðA:2Þ
Now, the former of the tuning law (27) without WRBis
consid-ered The negative condition in (27) is transformed as
1
2 e
T
t + Tet + T eTtet
= ð
t + T
t
eT_edt
=
ð
t + T
t
eT(f (x) + g^u + k^d)dt 0
ðA:3Þ
Let notice (21)
Q(e) = WTFef (x) +1
4W
TFeGFTeW1
4W
TFeKFTeW eHJI
ðA:4Þ Replacing Q(e) to (26), one obtains
^
H (e, ~W ) = ~WTFef (x) + ~WTFeðG KÞFTeW =2
~WTFeðG KÞFT
eW =4~ eHJI
ðA:5Þ
where W = W~ ^W is the NN weight estimation error
Transform s as
s=
ð
t + T
t
Fef (x) g^u + k^d
dt
=
ð
t + T
t
Fef (x)1
2FeðG KÞFT
eW^
dt
=
ð
t + T
t
Fef (x)1
2FeðG KÞFT
e W ~W
dt
=
ð
t + T
t
Fe f (x) + gu+ kd+1
2ðG KÞee
+1
2FeðG KÞFT
eW~Þdt ðA:6Þ Observing (A.5) and (A.6) with _~W =W ,_^ W_~1=W , the_~
error dynamics of (27) are written as
_~
W1=a1
m2
ð
t + T
t
Fe f (x) + gu+ kd+1
2ðG KÞee
:
+1
2FeðG KÞFT
eW~Þdt
3 ð
t + T
t
~
WTFe f (x) + gu+ kd+1
2ðG KÞee
:
+1
4W~
T
FeðG KÞFT
eW + e~ HJIÞdt ðA:7Þ where m = sTs+ 1 Let substitute (A.7), the dynamics (13) applying laws (24) and (25) into (A.2)
_L = ð
t + T
t
a2eTf (x) + g^u + k^d
dt
a1
m2
ð
t + T
t
~
WTFe f (x) + gu+ kd+1
2ðG KÞee
dt
0
@
1 A
2
a1 8m2
ð
t + T
t
~
WTFeðG KÞFT
eW dt~
0
@
1 A
2
3a1 4m2
ð
t + T
t
~
WTFe f (x) + gu+ kd+1
2ðG KÞee
dt
3 ð
t + T
t
~
WTFeðG KÞFT
eW dt~
a1
m2
ð
t + T
t
~
WTFe f (x) + gu+ kd+1
2ðG KÞee
dt ð
t + T
t
eHJIdt
a1 2m2
ð
t + T
t
~
WTFeðG KÞFT
eW dt~ t + Tð
t
It is straightforward to verify that by (A.3) there exists a posi-tive constant l0, such that
ðt + T t
a2eTf (x) + g^u + k^d
dt
ðt + T t
a2l0k kdt ðA:9Þe Substituting eHJIfrom (22) and (A.9) into (A.8) and complet-ing the square with respect tot + TÐ
t
~
WTFeðG KÞFT
eW dt and~ Ð
t + T t
~
WTFe f (x) + gu+ kd+1ðG KÞee
dt, one obtains
_L = a2l0
ð
t + T
t e
k kdt a1
m2
A
2+ C
a1
m2
B2
4 + C
3a1
m2
B2
8 + A
+9a1 4m2A2 a1
64m2B2+2a1
m2C
a2l0
ð
t + T
t e
k kdt + 9a1
4m2A2 a1
64m2B2+2a1
m2C2 ðA:10Þ
... H N tracking control of wheeled mobile robots with actuator saturation IEEE Transactions on Control Systems Technology 17(2): 449–457 Chwa D (2010) Tracking control of differential-drive wheeled. .. Automation, Control, and Intelligent Systems, pp 371–376.Luy NT, Thanh ND, Thanh NT, et al (2010) Robust reinforcement learning-based tracking control for wheeled mobile robot. .. wheeled mobile robot by dual heuristic program-ming Automatica 44: 2716–2723.
Luy NT (2012) Reinforcement learning-based optimal tracking con-trol for wheeled mobile robot In: