Hiện nay, một số nghiên cứu state-of-the-art thử nghiệm kết hợp các phương pháp Tính toán tiến hóa evolutionary computation - EC với Học tăng cường để giúp khắc phục các nhược điểm trên.
Trang 1FACULTY OF COMPUTER SCIENCE
NGUYEN TRUNG HIEU TRAN DINH KHANG
BACHELOR THESIS
DESIGN AND APPLICATION OF EVOLUTIONARY
REINFORCEMENT LEARNING METHODS FOR
CONTINUOUS CONTROL
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
HO CHI MINH CITY, 2021
Trang 2FACULTY OF COMPUTER SCIENCE
NGUYEN TRUNG HIEU - 18520750 TRAN DINH KHANG -18520072
BACHELOR THESIS
DESIGN AND APPLICATION OF EVOLUTIONARY
REINFORCEMENT LEARNING METHODS FOR
CONTINUOUS CONTROL
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
SUPERVISOR
DR LUONG NGOC HOANG
HO CHI MINH CITY, 2021
Trang 3CÔNG NGHỆ THÔNG TIN Độc Lập - Tự Do - Hạnh Phúc
TP HCM, ngày tháng năm
NHẬN XÉT KHÓA LUẬN TỐT NGHIỆP
(CÁN BỘ HƯỚNG DẪN)
Tên khóa luận:
DESIGN AND APPLICATION OF EVOLUTIONARY REINFORCEMENT
LEARNING METHODS FOR CONTINUOUS CONTROL
Nhóm SV thực hiện: Cán bộ hướng dẫn:
Nguyễn Trung Hiếu - 18520750 TS Lương Ngọc Hoàng
Trần Đình Khang - 18520072
Đánh giá Khóa luận
1 Về cuốn báo cáo:
Số trang Số chương
Số bảng số liêu Số hình vẽ
Số tài liêu tham khảo Sản phẩm
2 Về nôi dung nghiên cứu:
Trang 4Người nhận xét (Ký tên và ghi rõ họ tên)
Trang 5CÔNG NGHỆ THÔNG TIN Độc Lập - Tự Do - Hạnh Phúc
TP HCM, ngày tháng năm
NHẬN XÉT KHÓA LUẬN TỐT NGHIỆP
(CÁN BỘ PHẢN BIỆN)
Tên khóa luận:
DESIGN AND APPLICATION OF EVOLUTIONARY REINFORCEMENT
LEARNING METHODS FOR CONTINUOUS CONTROL
Nhóm SV thực hiện: Cán bộ phản biện:
Nguyễn Trung Hiếu - 18520750
Trần Đình Khang - 18520072
Đánh giá Khóa luận
1 Về cuốn báo cáo:
Số trang Số chương
Số bảng số liêu Số hình vẽ
Số tài liệu tham khảo Sản phẩm
2 Về nôi dung nghiên cứu:
3 Về chương trình ứng dụng:
Trang 64 Về thái đô làm viéc của sinh viên: Người nhận xét (Ký tên và ghi rõ họ tên)
Trang 7CÔNG NGHỆ THÔNG TIN
DE CƯƠNG CHI TIẾT
TÊN ĐỀ TÀI: THIẾT KẾ VÀ ỨNG DỤNG CÁC PHƯƠNG PHÁP HỌC TĂNG CƯỜNG
TIẾN HÓA CHO BÀI TOÁN ĐIỀU KHIỂN LIÊN TỤC
Cán bộ hướng dẫn: TS Lương Ngọc Hoàng
Thời gian thực hiện: Từ ngày 1/08/2021 đến ngày 30/12/2021
Sinh viên thực hiện:
Nguyễn Trung Hiếu - 18520750
Trần Đình Khang - 18520072
Nội dung đề tài:
A Mô tả bài toán: Bài toán điều khiển là bài toán thiết kế một bộ điều khiến (controller)
để một tác tử (agent) AI thực hiện các hành động thích hợp tại các tình huống khác nhau
sao cho đạt được kết quả sau cùng là tốt nhất có thể Tác tử thường phải hoàn thành hiệu
quả một tác vụ mong muốn thông qua việc tối đa hóa một hàm phần thưởng (reward
function) Khác với Học có giám sát, tác tử AI không được biết trước là nên thực hiện
hành động nào ở mỗi trạng thái (không có dữ liệu huấn luyện) mà cần phải học các thông
tin này thông qua việc tương tác với môi trường Tác tử phải thử nghiệm thực hiện các
hành động khác nhau, và môi trường sẽ trả về các tín hiệu cho biết trạng thái mới của tác
tử sau khi thực hiện mỗi hành động.
Bộ điều khiển (controller) cho các tác tử AI có thể được thiết kế bằng các mạng neural.
Các mạng neural này thường được huấn luyện bằng các thuật toán Học tăng cường
(reinforcement learning - RL) Tuy nhiên, các thuật toán RL có một số nhược điểm về
hiệu suất trong môi trường có nhiều nhiễu (noise) và chịu ảnh hưởng rất lớn từ việc thiết
lập các siêu tham số (hyperparameters) Hiện nay, một số nghiên cứu state-of-the-art thử
nghiệm kết hợp các phương pháp Tính toán tiến hóa (evolutionary computation - EC)
với Học tăng cường để giúp khắc phục các nhược điểm trên.
B Đối tượng và phạm vi nghiên cứu: Khóa luận nghiên cứu các phương pháp Học tăng
cường tiến hóa (evolutionary reinforcement learning - ERL) để huấn luyện mạng neural
của các tác tử AI trong bài toán điều khiển Khóa luận tập trung vào lớp các bài toán điều
khiển có không gian hành động liên tục.
13
Trang 8(evolutionary computation - EC), Học tăng cường (reinforcement learning - RL),
và Học tăng cường tiến hóa (evolutionary reinforcement leaming - ERL).
+ Thiết kế và thực nghiệm một số phương pháp cải tiến hiệu suất của các phương
pháp ERL hiện tại.
D Phương Pháp thực hiện:
Nội dung 1: Khảo sát nhóm thuật toán Học tăng cường tiến hóa cho bài toán điều
khiển với không gian hành động liên tục
Phương pháp thực hiện: Cài đặt và chạy thí nghiệm so sánh các phương pháp Học
tăng cường tiến hóa được công bố gần đây Sử dụng các tác vu trong bộ công cụ Mujoco
để đánh giá hiệu suất Phân tích ưu khuyết điểm của các phương pháp này.
Nội dung 2: Đề xuất phương pháp cải thiện hiệu suất thuật toán
Phương pháp thực hiện: Với kết quả từ nội dung 1 tiến hành đề ra các phương pháp
cải thiện hiệu suất của các thuật toán Học tăng cường tiến hóa.
E Kết quả mong đợi:
* Bao cáo phân tích, so sánh các phương pháp Hoc tăng cường tiến hóa hiện nay.
* Bao cáo và kết quả thực nghiệm của các phương pháp cải thiện hiệu suất.
Kế hoạch thực hiện:
* - Nội dung 1: tháng 8-9 năm 2021
* - Nội dung 2: tháng 10-11 năm 2021
* _ Tổng hợp kết quả và viết KLTN: tháng 11-12 năm 2021
Xác nhận của CBHD TP HCM, ngày 30 tháng 12 năm 2021
(Ký tên và ghi rõ họ tên) Sinh viên
(Ký tên và ghi rõ họ tên) Nguyễn Trung Hiếu
Lương Ngoc Hoàng
Trần Đình Khang
14
Trang 9First and foremost, we would like to give a special thank to our supervisor,
Dr Luong Ngoc Hoang, who teaches, guides and works with us in our sis He taught us invaluable lessons in research methodology, evolutionary
the-algorithms and machine learning Also, he gives us insightful discussions
whenever we need, whether it is a problem in research projects, or just our spontaneous curiosity about a lecture Besides, he provides us considerable encouragement, advice, and most importantly, computational power when
we face with overwhelming problems in our thesis Again, our greatest itude to Dr Luong Ngoc Hoang It is our honour to work with him.
grat-We would like to thank our talented friends, in KHTN2018, for the time
we study together and motivate each other to constantly move forward.
Last but not least, we would like to thank UIT for such an outstanding vironment for study and research This contributes greatly to the completion
en-of our thesis.
Trang 1011 TntroducHon| 1
{1.2 Continuous control problem| - 3
1.3 Applicationsl co 4 1.4 Challenges and open problems| 4
[5 Ourcontribution} 2.2.2.0 0.0 eee ee 6 1 tructure of thesisl -.- 6
2 Background 8 [21 Policy Searhl - 8
2.1.1 Stochastic policy and deterministic policy] 10
2.2 Policy gradientmethodsl 10
2.2.1 Policy GradientTheorem| 11
2.2.2 Deterministic Policy Gradient (DPG) Theorem [20]] 13
(2.2.3 Actor-Critic architecturel 16
[2.4 On policy algorithm vs off policy algorithm] 1
2.2.5 Twin Delayed Deep Deterministic Policy Gradient (TD3)| 18 2.2.6 Soft Actor-Critcl 19
2.3 Evolutionary Algorithms (EA) and Cross Entropy Method (CEM)] 23 2.3.1 Evolutionary Algorithms(EA) 23
2.3.2 Cross Entropy Method(CEM) 24
24 CEM-RI] 2 2 ee 25
Trang 113 Proposing Methods|
3.1 eTD3l Ặ eee
3.2 CEM-SAC|
3.2.1 EnhancingSACwithCEM|
3.3 Comparison with previous
worksÌ -4 EXPERIMENTS 4.1 Mujoco Continuous Control Benchmark|
4.2 Evaluation
Protocoll -4.3 Implementation
Detailsl -4.3.1 Proposedmethodsl
4.3.2 Baselinealgorithmsl
4.4 Resulsl ẶẶẶ SỐ VỐo 441 elD223vsTID23]
44.2 CEM-SACvsS5AC]
4.5 Discussion| ẶẶẶẶ ee 1 Summary] pe eee 5.2 Limitations) 0.20 00000000000 0004 5.3 Future DirecHonsl
Bibliography|
29
29 32 32 34
38 38 39 40 40 41 41 41 43 46
49
49 50 50
52
Trang 12List of Figures
1.1 Markov Decision Process (MDP) Agent interacts with
environ-¬ ee 3
1.2 A screenshot of the Humanoid in MuJoCo Physics Simulation.
In Humanoid, we have to teach our agent to stand up and
move forward as fast as possible| 5
2.1 A parameterised policy function| 9
2.2 Actor Criticarchitecturel - 17
2.3 Atypical EA algorthml| 23
2.4 IHustraton o£CEM-RLÌ 27
3.1 SchemaofelD3] 30
3.2 Illustration of CEM-SAC algorithm| 37
4.1 MuJoCo locomotion tasks: Ant-v3, HalfCheetah-v3, Walker2d-[ v3, Hopper-v3, Humanoid-v3, HumanoidStandup-v2] 39
4.2 Average performance results on MuJoCo-v3 environments of [ eTD3, CEM-TD3, and TD3 over 10 independent runs] 42
4.3 Average performance results on MuJoCo-v3 environments of CEM-SAC, CEM-TD3, and SAC over 10 independent runs| 45 4.4 10 runs of CEM-SAC and 10 runs of SAC CEM-SAC shows
more stability than SAC over the training session, with much
less pitfalls) 0 eee 47
Trang 13List of tables
4.1 Network Architecturel - 40
4.2 Hyperparametervalue|l - 40
4.3 Final performance of eTD3, CEM-TD3 and TD3| 43
[4.4 Results of statistical testsl - 43
[4.5 Final performance of CEM-SAC, CEM-TD3 and SAC] 44
4.6 Student t-test taken across 10 executions of CEM-SAC, SAC and CEM-TD3.| 44
4.7 Final performance of CEM-SAC, CEM-TD3 and SAC (version 2)| 46
Trang 14DRL Deep Reinforcement Learning
EA Evolutionary Algorithm
EC Evolutionary Computation MDP_ Markov Decision Process CEM Cross Entropy Method TD3 Twin Delayed Deep Deterministic Policy Gradient
SAC Soft Actor Critic
DDPG Deep Deterministic Policy Gradient
ES Evolutionary Strategies ERL Evolutionary Reinforcement Learning
Trang 15In this thesis, we enhance the performance of the twin delayed deep ministic (TD3) and soft actor-critic (SAC), i.e., state-of-the-art off-policy policy
deter-gradient algorithms, by coupling it with the cross-entropy method (CEM), i.e.,
an estimation-of-distribution algorithm The hybrid approaches, eTD3 and CEM-SAC, exhibit both the efficiency of policy gradient algorithms and the stability of CEM in training policy neural networks of reinforcement learning agents for solving control problems.
Our work here extends the evolutionary reinforcement learning (ERL) line
of research on integrating the robustness of population-based stochastic box optimization, that typically assumes little to no problem-specific knowl- edge, into the training process of policy gradient algorithms, that exploits the sequential decision making nature for efficient gradient estimation Exper- imental result comparisons with the baselines TD3, SAC and CEM-TD3, a recently-introduced ERL method that combines CEM and the TD3 algorithm,
black-on a wide range of cblack-ontrol tasks in the MuJoCo benchmarks cblack-onfirm the hanced performance of our proposed CEM-SAC.
Trang 161.1 Introduction
Reinforcement learning (RL) is a machine learning paradigm to train cial intelligence (AI) agents to accomplish a certain task when there does not
artifi-exist a dataset of training examples with corresponding ground truth labels.
It is not straightforward, and might be not realizable, to effectively construct such a dataset in many control problems, such as robotics, inventory manage- ment, real-time bidding Designated tasks often involve the optimization of
a reward/objective function RL algorithms train the agents via an iterative
exploration-exploitation mechanism [13] At each iteration, experiences,
of-ten in the form of (state, action, reward, next state), are obtained by interacting directly with the environment These collected training signals are then em- ployed to favorably adjust the trial-and-error interactions of the agents at sub- sequent iterations In recent years, the couplings with deep neural networks (DNNs) and other deep learning (DL) techniques have enabled RL to achieve
successful applications in a wide range of complicated problems with
high-dimensional state spaces and action spaces Policy neural networks, that ommends which action agents should perform given each input state, are of- ten trained by a class of deep reinforcement learning (DRL) algorithms called policy gradient methods Stochastic gradients of reward functions with re-
rec-spect to the policy parameters can be used to update network weights at each
Trang 17gradient ascent iteration.
Deep reinforcement learning (DRL) algorithms, however, still suffer from some essential obstacles: credit assignment with sparse/delayed reward sig- nals, premature convergence due to difficulties in maintaining a meaningful
diverse exploration, sensitivity to hyperparameter settings (9}(14] Salisman et
al proposed to treat the training of an agent’s policy network as a
black-box optimization problem, which can then be solved by evolution strategies
(ES), in particular natural evolution strategies [26], with competitive results
compared to state-of-the-art DRL algorithms Notable advantages of tionary computation (EC) as a scalable alternative to RL have been reported
ac-¢ 3) EC methods are typically robust, and competitive results can be tained with the same fixed hyperparameters in multiple environments rather than having to perform intensive problem-specific hyperparame-
ob-ter tunings.
However, zeroth-order optimization techniques like (classic) EC methods
might have slow convergence speed and exhibit sample inefficiency when
the problem dimensionality is large [11] It is the case that DRL agents’ policy
neural networks often have a huge number of parameters that need to be trained Recent studies have been made to combine the strengths of both DRL and EC methods to efficiently and reliably train policy networks in solving
control problems [9} [14].
Trang 18Action vector a ~ Z¿(s)
>
AGENT ENVIRONMENT
710 <
Reward #(s, a), next state s”
FIGURE 1.1: Markov Decision Process (MDP) Agent interacts
with environment
1.2 Continuous control problem
Optimal control is a problem of designing a controller to optimize the sired behaviour of a dynamical system over time Continuous control prob- lem is an optimal control problem where controller outputs control signals that are continuous values based on received state signals In modern Rein- forcement Learning literature, continuous control problem is usually formu- lated as discrete-time stochastic control, known as Markov Decision Process (MDP).
de-In these problems, there involves (an) intelligent agent(s) making decision
in an environment with uncertain elements Each agent’s behaviour is resented by its policy 7, which should be searched to maximize expected returns from the environment Most of the time, a dynamic model of the
rep-environment, or function that describes transition between agent states, is
ex-tremely complex to build and, thus, treated as a black-box function.
In many approaches, control problem can be formulated as a policy search
problem, in which policy function 7r(a|s) describes the agent’s behaviour We
want to find the policy that maximizes some objective function that represents the type of behaviour we want our agent to learn In our thesis, we parame- terize the policy function using artificial neural network and adopt machine learning approach to automatically find our control policy (more detail on chapter|2).
Trang 19¢ Gaming: Artificial Intelligence is currently adopted by gaming
indus-tries to improve player experience The most related application is to develop adaptive and responsive non-player character (NPC) Adopting
a policy search method would reduce the need of hard-coding certain behaviour of NPC (walking, running, sprinting, interacting with human player).
1.4 Challenges and open problems
Our thesis focuses on using the framework suggested by Model-free forcement Learning to teach intelligent agents tacit knowledge such as walk- ing and running The term ‘model-free’ suggests that our intelligent agents
Rein-do not have access to a dynamic model of the environment, which is used to
predict future states based on current action and current state In other words, agents are not fed any physical knowledge about the environment, which is also non-trivial to construct in such complex tasks, other than the state vec- tor containing its own joint Cartesian coordination, angles The agents have
to collect and process data from the environment by itself, then use data to
figure out the physical rules of its environment and learn to do a set of tasks such as standing up, walking or running.
Trang 20FIGURE 1.2: A screenshot of the Humanoid in MuJoCo Physics Simulation In Humanoid, we have to teach our agent to stand
up and move forward as fast as possible.
While applying such framework in training a physical robot is highly ing, current methods have not achieved the sample efficiency that is accept- able to real-world settings For example, no research institution would risk breaking their expensive robots by allowing it to try and fail a million times.
tempt-As a result, current research often needs to carefully experiment with the
al-gorithms to investigate their sample efficiency in simulation environments before rolling out in real-world applications Our thesis is also conducted in this direction.
Compared to classic approaches in continuous control, parameterizing a controller with deep neural network gives rise to optimization problems with
extremely high dimensionalities Naively applying modern EC methods would
certainly cause memory issues or incur exceedingly long running time thermore, noises created by stochastic elements in some control problems hinder the convergence of optimization algorithms Effectively exploring in
Trang 21Fur-noisy environments while maintaining an acceptable sample efficiency is a
non-trivial challenge.
1.5 Our contributions
Our thesis extends the research line of combining Evolutionary Computation
and Deep Reinforcement Learning for Continuous Control problem:
¢ We survey and analyze current state-of-the-art algorithms in
Evolution-ary Deep Reinforcement Learning
¢ We our proposed combining mechanism, which is an improvement upon
the mechanism proposed by Pourchot et al.
¢ We apply our proposed mechanism on a new combination of Cross
En-tropy Method and Soft Actor-Critic
¢ We conduct experimentation on MuJoCo benchmark to empirically prove
stability and efficiency of our algorithms.
1.6 Structure of thesis
The remainder of this thesis is organized as follows.
° Chapter | briefly reviews the background knowledge that is related to
our work, including the policy search problem, based and
gradient-free approaches to policy search.
¢ Chapter BỊ] introduces our proposed algorithms, eTD3 and CEM-SAC.
We also provide comparisons in concepts between our algorithms and
related algorithms.
Trang 22our study Based on experiment results, we draw comparisons to line methods and discuss possible extensions to our work.
base-° Chapter[5|concludes our study and suggests potential directions for our
future works.
Trang 23In this chapter, we revise the most essential knowledge behind our work In the first section, we explain policy and policy search in Reinforcement Learn-
ing Literature The remaining sections are used to discuss gradient-based
approaches (i.e policy gradient methods) and gradient-free approaches (i.e Evolutionary Computation).
2.1 Policy Search
The core of an agent is its policy (or controller) 7r that dictates which action
at € A to take given a state (or an observation) s; € S at each time step f An instant reward r; is feed back to the agent, and the environment is transitioned
to the next state s,,ị The sequence + = (S0,đọ,f0,S1,- „St, đt, Tt, ¡+1 - - -)
forms a trajectory of the agent through the state space up to a horizon t = H,
the return of trajectory is G(t) = LiLo *r, and the return after time steps t can be obtained G; = an okt, where + € (0, 1] is a discount factor [4] Be-
cause there typically exists certain randomness in the environment (e.g.,
tran-sitions and rewards follow some probability distribution Pr(s¿,1,7;|s¿„;)),
and the agent can have a stochastic policy (i.e., a random distribution over the
action space conditioned on the state space 7r(4;|s¿) = Pr(z;|s¿)), the agent’s
trajectories would be different from time to time RL aims to train the agent
to obtain the maximum expected return J(7t) = Er~[G(T)].
Trang 24VECTOR
The agent’s policy can be parameterized by a parameter vector Ø, and in DRL, 0 would be the weight vector representing a deep neural network 71ạ The training process in RL is then equivalent to employing a search algorithm
to find a policy parameter vector 6* yielding an optimal policy 7rạ: that is able
to maximize the expected return
6* = arg max J(ạ) = argmax [EE [G(T)] = argmax [ G(t)Pr(t| 719) dt
9 6 UE) 6 T
(2.1)
where the expectation is computed by sampling, and Pr(t|@) indicates the
probability of a trajectory 7 given a policy 79.
In reinforcement learning, alongside a policy, there also are two
defini-tions that are important: Value function V%(s) and state-action value function
Q7(s,a).
¢ Value function is used to evaluate value of a state given that agent strictly
follows policy zr Definition of value function is given in|2.2]
* State-action value function is used to evaluate quality of a state-action pair given that agent strictly follows policy 7 Definition of state-action
value function is given in2.3]
Trang 25V*(s) = E[Gi|S+ = 5] = EI 'n|S = 5] (22)
k=t
QTM(s,a) = E|Gi|S¡ = s, Ar = a] = B() } "4/5 = s, Ar = a] (2.3)
ket
2.1.1 Stochastic policy and deterministic policy
In DRL literature, there exists two types of policy: deterministic policy 7Zr(s)
and stochastic policy 7t(a|s).
* Deterministic policy 7r(s) is a function that receives state vector and
out-put action vector Algorithms that use this type of policy include
Deter-ministic Policy Gradient [20], Deep DeterDeter-ministic Policy Gradient
and Twin Delayed DDPG [5].
¢ Stochastic policy 7(a|s) is a distribution over action space conditioned
over state space Algorithms that use this type of policy include
Stochas-tic Actor-CriStochas-tic[3] and Soft Actor-criStochas-tic (SAC)[6
Next, we describe two popular families of policy search algorithms, policy gradient and evolutionary computation.
2.2 Policy gradient methods
Policy gradient methods are a family of RL algorithms that relies on
gra-dient computation of policy function VạJ(7r¿) to perform policy
improve-ment In this section, we firstly summarised the works of Williams [27] ,
Sut-ton et al and Silver el al [20], which propose feasible gradient
computa-tion methods for modern model-free DRL algorithms.
Trang 26Williams[27] propose an approach to compute gradient of a policy function
in MDP formulation In policy search, the objective is to find the policy that maximizes expected reward obtained through a decision sequence T =
Next, we need to expand Pr(r|7rạ) to derive a practical formula Using chain
rule in probability , we obtain
=z
Pr(r|7rạ) = øo(so) | [ Pr(se+1\st, ar) 7t@(atls+) (2.5)
¡=0
Where ø(sọ) is the initial state distribution and Pr(s 1 |s¿, a) is state transition
probability Using product property of logarithm, we have:
Trang 28Vol (me) = E_ [Valogra(si|a:)Q”*(s¡,a¡)]Tum
= ba! Q” (st, ar) Velog7te(selar) en i=
where Q”*(s,a) is the action-value function of policy Zrạ, which indicates
the expected return that can be achieved if the agent performs action a in
state s and then following policy 7g thereafter (22) From here there are two
approaches to perform policy improvement using equation above:
¢ Monte Carlo updates: Run a full trajectory and average over collected samples.
¢ Temporal difference: Approximate a parameterized Q”2(s,ø) and
elim-inate the need of running a full trajectory This is proven to be possible
using a more general policy gradient formations [22].
Vo] (9) = Exy|QTM Vo In 709(a\s)] (2.9)
2.2.2 Deterministic Policy Gradient (DPG) Theorem [20
While the Policy gradient theorem offers a feasible way to compute gradient
of a stochastic policy 7t9(a|s), DPG theorem derives gradient of a tic policy 7tg(s) DPG Theorem is the basis of policy learning in many mod-
determinis-ern DRL algorithms that aim to optimize a deterministic policy, such as DPG,
DDPG, TD3.
Trang 29For clarification and simplicity, we define related variables as follows:
® 79(s) = 7(s): deterministic policy parameterised by Ø
® Vz,(s) = V(s): value function of state s under given policy
® Qz„(s) = Q(s,a): action value function of state-action pair (s, 4).
S: state space
® +: discount factor.
P(s'|s,a), P(s — s’,t, 79) : probability of transitioning from state s to s”
by taking action a
J() = V(sọ): the objective function in episodic MDP is the value
func-tion of the first state in a episode, following policy 7r
We have:
VV(s)
= VạQ(s,8) V(s)=Q(s,a) in deterministic policy
= Ve (r(s, 71(s)) + ElzV(s')]) definition of Q(s,a)
Trang 30= I, (xVaP(sls, z(s))V(s') + +P(s'ls, z(s))V¿V(s')) 4s! product rule
= | (1Yax(s)VaP(sfls,)|,~x(s)V (3) + +P( Is,z(s))V¿V(s))ds chain rule
Notice the highlighted common term Vạ7Zr(s)Vz in A and B, we continue
to derive the gradient of value function:
VV(s)
= Vumt(s)Wu (r(s,a) + | (xP(sls,a)V(s)43)|_„¿) + [_ TP s,(s))VaV(s))) 4#
= Vạ7r(s) VaQ(s,8)Ì¿—z~(s) + [_+P(sls,(s))VaV(s)ds' definition of Q(s„z)
If we keep unrolling V(s/) in the above equation, we can rewrite the
equa-tion with the following form:
VV(s) = j „` y'P(s — s”,t, xạ) Vạ7t(s) VaQ(s,4) |a—n(s) 48" (2.10)
—
Plugging Eq into VJ (70), we have:
= | Pi(s)VaV(s)đs p1(s) does not depends on Ø
= | rils)Vo | LPC > sut,4)Wo/x(s)VzO(s,4)|,=m(sj đề
cận s) Vạ7r(s )VaQ(s,) |¿—x(s)đs
Ex|Va7r(s)VaQ(s,4)]
Trang 312.2.3 Actor-Critic architecture
In model-free DRL, we can choose to improve agent’s policy by directly
op-timizing a parameterized policy or learning a value function (12 which
is used to estimate action values in given states (i.e., from which a policy can
be inferred) However, it could be useful to learn both a value function Q„and a policy function 7tg simultaneously A policy function 7rạ, or actor, gen-
erates action a from given state s, then executes a to gain next state s’ , reward
value r and termination indicator d € {0,1} A tuple of (s,a,r,s',d) is called
an experience, similar to a data point, which is used to compute gradient
up-dates for the value function, or critic The critic considers the state and theaction chosen by the actor, and then computes a Q-value that is used to up-date the actor’s parameter vector 6 This entire process can be summarised
by the following equations:
8 — 6 + a¿Q¿(s„4)Vạ In 7(a|5) (2.11)
Ot = rị + +Qạ(s”,a') — Qo(s,a) (2.12)
ÿ — @+agôrV¿Q¿(s,a) (2.13)
The Actor-Critic architecture has several attractive properties [19]:
® Accuracy and scalability: Actor-Critic methods represent their policy
function 7tg and value function Qạ as non-linear deep neural networks
As a result, DRL offers an approach to tackle problems with extremelylarge state space and action space This is considered one of the mainstrength of current algorithms in DRL in compared to traditional meth-ods in robotic control
¢ Stability: most algorithms that implement actor-critic architecture make
use of a target critic(s) alongside the main critic(s) The target criticserves as a lagged version of main critic and gets updated less frequently
Trang 32FIGURE 2.2: Actor Critic architecture
in order to stabilize the algorithm As for why target critic can stabilizeDRL algorithm, we refer to the work of Lilicrap et al
2.2.4 On policy algorithm vs off policy algorithm
Current algorithms in model-free Reinforcement Learning are divided intoon-policy algorithms and off-policy algorithms The difference between twoclasses of algorithms is characterised by the way DRL algorithms process its
collected data.
On policy algorithms Algorithms in this family relies on Monte Carlo pling, which collects a large number of samples, to evaluate current policy.Optimizing the policy network is a common regression problem: Given thesamples (where state-action pairs are examples and reward values are labels)
sam-in msam-ini-batches, fsam-ind the optimal network parameters that maximize expected
Trang 33reward function Afterwards, those samples are discarded because they are
no longer related to the newly-updated policy
Off policy algorithm Algorithms in this family assume that any sample
generated by any policy function (ie past policy function) can be used tooptimize current policy function As a result, a data storage, known as replaybuffer, is adopted to store the samples that are already used in optimizingpolicy function Mini-batches of samples from replay buffer are then re-used
to update the current policy function Thanks to the use of replay buffer, policy methods often achieve higher sample efficiency than on-policy meth-ods
off-Powerful properties of Actor-Critic architecture combined with samplereuse mechanism from off-policy methods are the main ingredients of cur-rent state-of-the-art DRL algorithms in continuous control domain
In our thesis, we experiment with TD3 and SAC, which are two popularalgorithms in off-policy methods that use Actor-Critic architecture The maindifference between the two algorithms is the nature of their actor TD3 make
use of a deterministic actor 7t9(s), while SAC prefers stochastic actor 7tg(a|s).
As actor is the equivalent of policy, distinction between two types is given in
section
2.2.5 Twin Delayed Deep Deterministic Policy Gradient (TD3)
TD3 [5] is a model-free off-policy DRL algorithm, in which an actor (or icy network/controller) and two critics (or value functions) are learned at the
pol-same time Every time agent makes a decision in the environment, the
expe-rience is stored in an off-policy replay buffer Those data is then sampled intomini batches and used to update the critics with a modified Bellman equation:
y(r,si,d) =r + y(1~ d) min Qy(s',a’)
Trang 34aw clip (7t97(s’) + €, Amin, Amax )
e’ ~ clip(M(0,ơ),—c,c)
where action a’ is sampled from the target actor network 7(s’) with some
target policy noise e’ The clip operations are to ensure that actions and noises
are within allowable ranges [4min, @max] and [—c,c], respectively The
param-eter vector 6’ of the target actor 7rạ is updated based on Ø after every certain
number of time steps The critics are used to update the actor using DPGtheorem equation:
Vol (74) = Ex[Va7r(s)VaQ(s,4)] (2.14) Ex|Ve
E7|V@Q(s,a)| (2.15)
Algorithm{2|summarizes the pseudo-code of TD3 adapted from [1].
2.2.6 Soft Actor-Critic
Algorithms that belong to this family can be implemented in an on-policy
setting (i.e., experiences sampled under policy 7tg are discarded after being
used to approximate VạJ(7rg)) (18) [17], which is sample inefficient but offers a
higher stability, or off-policy setting (i.e., experiences sampled under different
polices are can stored in a buffer and are later used to estimate VạJ(7rạ)) [24
5], which has a higher sample efficiency but sensitive to many rameters [7] Soft actor-critic (SAC) [6] is an off-policy algorithm that is built
hyperpa-on the actor-critic framework SAC introduces an entropy regularizatihyperpa-on terminto the RL objective function and a policy should learn to maximize the fol-lowing function:
Trang 35Algorithm 1 Twin Delayed DDPG (TD3)
1: Input: 6, $1, $2, replay buffer D
2: max simulation steps H, the number of updates per step U
3 Actor update frequency freq
4: Initialize: 7rạ, Qại, Q¿,, Oe 0, Pi-19 — Ó¡~1„2
5: fort = 1 to H do
6: Execute action a; ~ clip(7t9(st) + €, amin, Imax)
where e ~ NV (0,0) is some exploration noise.
ụm Observe next state s¿¡ 1, reward r¡
9: for7=0toLI— 1do
10: Sample a batch of data B 7 and compute y(r,s’,d)
11: Run 1 step gradient descent on critics
1
Vui }È, (Q,~ y(r,s',d))? fori = 1,2 | | (s,a,r,s!,d)€
12: if j mod freq = 0 then
13: Run 1 step gradient ascent on actor