Khóa luận tốt nghiệp Khoa học máy tính: Thiết kế và ứng dụng các phương pháp học tăng cường tiến hóa cho bài toán điều khiển liên tục

Hiện nay, một số nghiên cứu state-of-the-art thử nghiệm kết hợp các phương pháp Tính toán tiến hóa evolutionary computation - EC với Học tăng cường để giúp khắc phục các nhược điểm trên.

Trang 1

FACULTY OF COMPUTER SCIENCE

NGUYEN TRUNG HIEU TRAN DINH KHANG

BACHELOR THESIS

DESIGN AND APPLICATION OF EVOLUTIONARY

REINFORCEMENT LEARNING METHODS FOR

CONTINUOUS CONTROL

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

HO CHI MINH CITY, 2021

Trang 2

FACULTY OF COMPUTER SCIENCE

NGUYEN TRUNG HIEU - 18520750 TRAN DINH KHANG -18520072

BACHELOR THESIS

DESIGN AND APPLICATION OF EVOLUTIONARY

REINFORCEMENT LEARNING METHODS FOR

CONTINUOUS CONTROL

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

SUPERVISOR

DR LUONG NGOC HOANG

HO CHI MINH CITY, 2021

Trang 3

CÔNG NGHỆ THÔNG TIN Độc Lập - Tự Do - Hạnh Phúc

TP HCM, ngày tháng năm

NHẬN XÉT KHÓA LUẬN TỐT NGHIỆP

(CÁN BỘ HƯỚNG DẪN)

Tên khóa luận:

DESIGN AND APPLICATION OF EVOLUTIONARY REINFORCEMENT

LEARNING METHODS FOR CONTINUOUS CONTROL

Nhóm SV thực hiện: Cán bộ hướng dẫn:

Nguyễn Trung Hiếu - 18520750 TS Lương Ngọc Hoàng

Trần Đình Khang - 18520072

Đánh giá Khóa luận

1 Về cuốn báo cáo:

Số trang Số chương

Số bảng số liêu Số hình vẽ

Số tài liêu tham khảo Sản phẩm

2 Về nôi dung nghiên cứu:

Trang 4

Người nhận xét (Ký tên và ghi rõ họ tên)

Trang 5

CÔNG NGHỆ THÔNG TIN Độc Lập - Tự Do - Hạnh Phúc

TP HCM, ngày tháng năm

NHẬN XÉT KHÓA LUẬN TỐT NGHIỆP

(CÁN BỘ PHẢN BIỆN)

Tên khóa luận:

DESIGN AND APPLICATION OF EVOLUTIONARY REINFORCEMENT

LEARNING METHODS FOR CONTINUOUS CONTROL

Nhóm SV thực hiện: Cán bộ phản biện:

Nguyễn Trung Hiếu - 18520750

Đánh giá Khóa luận

1 Về cuốn báo cáo:

Số trang Số chương

Số bảng số liêu Số hình vẽ

Số tài liệu tham khảo Sản phẩm

2 Về nôi dung nghiên cứu:

3 Về chương trình ứng dụng:

Trang 6

4 Về thái đô làm viéc của sinh viên: Người nhận xét (Ký tên và ghi rõ họ tên)

Trang 7

CÔNG NGHỆ THÔNG TIN

DE CƯƠNG CHI TIẾT

TÊN ĐỀ TÀI: THIẾT KẾ VÀ ỨNG DỤNG CÁC PHƯƠNG PHÁP HỌC TĂNG CƯỜNG

TIẾN HÓA CHO BÀI TOÁN ĐIỀU KHIỂN LIÊN TỤC

Cán bộ hướng dẫn: TS Lương Ngọc Hoàng

Thời gian thực hiện: Từ ngày 1/08/2021 đến ngày 30/12/2021

Sinh viên thực hiện:

Nguyễn Trung Hiếu - 18520750

Nội dung đề tài:

A Mô tả bài toán: Bài toán điều khiển là bài toán thiết kế một bộ điều khiến (controller)

để một tác tử (agent) AI thực hiện các hành động thích hợp tại các tình huống khác nhau

sao cho đạt được kết quả sau cùng là tốt nhất có thể Tác tử thường phải hoàn thành hiệu

quả một tác vụ mong muốn thông qua việc tối đa hóa một hàm phần thưởng (reward

function) Khác với Học có giám sát, tác tử AI không được biết trước là nên thực hiện

hành động nào ở mỗi trạng thái (không có dữ liệu huấn luyện) mà cần phải học các thông

tin này thông qua việc tương tác với môi trường Tác tử phải thử nghiệm thực hiện các

hành động khác nhau, và môi trường sẽ trả về các tín hiệu cho biết trạng thái mới của tác

tử sau khi thực hiện mỗi hành động.

Bộ điều khiển (controller) cho các tác tử AI có thể được thiết kế bằng các mạng neural.

Các mạng neural này thường được huấn luyện bằng các thuật toán Học tăng cường

(reinforcement learning - RL) Tuy nhiên, các thuật toán RL có một số nhược điểm về

hiệu suất trong môi trường có nhiều nhiễu (noise) và chịu ảnh hưởng rất lớn từ việc thiết

lập các siêu tham số (hyperparameters) Hiện nay, một số nghiên cứu state-of-the-art thử

nghiệm kết hợp các phương pháp Tính toán tiến hóa (evolutionary computation - EC)

với Học tăng cường để giúp khắc phục các nhược điểm trên.

B Đối tượng và phạm vi nghiên cứu: Khóa luận nghiên cứu các phương pháp Học tăng

cường tiến hóa (evolutionary reinforcement learning - ERL) để huấn luyện mạng neural

của các tác tử AI trong bài toán điều khiển Khóa luận tập trung vào lớp các bài toán điều

khiển có không gian hành động liên tục.

13

Trang 8

(evolutionary computation - EC), Học tăng cường (reinforcement learning - RL),

và Học tăng cường tiến hóa (evolutionary reinforcement leaming - ERL).

+ Thiết kế và thực nghiệm một số phương pháp cải tiến hiệu suất của các phương

pháp ERL hiện tại.

D Phương Pháp thực hiện:

Nội dung 1: Khảo sát nhóm thuật toán Học tăng cường tiến hóa cho bài toán điều

khiển với không gian hành động liên tục

Phương pháp thực hiện: Cài đặt và chạy thí nghiệm so sánh các phương pháp Học

tăng cường tiến hóa được công bố gần đây Sử dụng các tác vu trong bộ công cụ Mujoco

để đánh giá hiệu suất Phân tích ưu khuyết điểm của các phương pháp này.

Nội dung 2: Đề xuất phương pháp cải thiện hiệu suất thuật toán

Phương pháp thực hiện: Với kết quả từ nội dung 1 tiến hành đề ra các phương pháp

cải thiện hiệu suất của các thuật toán Học tăng cường tiến hóa.

E Kết quả mong đợi:

* Bao cáo phân tích, so sánh các phương pháp Hoc tăng cường tiến hóa hiện nay.

* Bao cáo và kết quả thực nghiệm của các phương pháp cải thiện hiệu suất.

Kế hoạch thực hiện:

* - Nội dung 1: tháng 8-9 năm 2021

* - Nội dung 2: tháng 10-11 năm 2021

* _ Tổng hợp kết quả và viết KLTN: tháng 11-12 năm 2021

Xác nhận của CBHD TP HCM, ngày 30 tháng 12 năm 2021

(Ký tên và ghi rõ họ tên) Sinh viên

(Ký tên và ghi rõ họ tên) Nguyễn Trung Hiếu

Lương Ngoc Hoàng

Trần Đình Khang

14

Trang 9

First and foremost, we would like to give a special thank to our supervisor,

Dr Luong Ngoc Hoang, who teaches, guides and works with us in our sis He taught us invaluable lessons in research methodology, evolutionary

the-algorithms and machine learning Also, he gives us insightful discussions

whenever we need, whether it is a problem in research projects, or just our spontaneous curiosity about a lecture Besides, he provides us considerable encouragement, advice, and most importantly, computational power when

we face with overwhelming problems in our thesis Again, our greatest itude to Dr Luong Ngoc Hoang It is our honour to work with him.

grat-We would like to thank our talented friends, in KHTN2018, for the time

we study together and motivate each other to constantly move forward.

Last but not least, we would like to thank UIT for such an outstanding vironment for study and research This contributes greatly to the completion

en-of our thesis.

Trang 10

11 TntroducHon| 1

{1.2 Continuous control problem| - 3

1.3 Applicationsl co 4 1.4 Challenges and open problems| 4

[5 Ourcontribution} 2.2.2.0 0.0 eee ee 6 1 tructure of thesisl -.- 6

2 Background 8 [21 Policy Searhl - 8

2.1.1 Stochastic policy and deterministic policy] 10

2.2 Policy gradientmethodsl 10

2.2.1 Policy GradientTheorem| 11

2.2.2 Deterministic Policy Gradient (DPG) Theorem [20]] 13

(2.2.3 Actor-Critic architecturel 16

[2.4 On policy algorithm vs off policy algorithm] 1

2.2.5 Twin Delayed Deep Deterministic Policy Gradient (TD3)| 18 2.2.6 Soft Actor-Critcl 19

2.3 Evolutionary Algorithms (EA) and Cross Entropy Method (CEM)] 23 2.3.1 Evolutionary Algorithms(EA) 23

2.3.2 Cross Entropy Method(CEM) 24

24 CEM-RI] 2 2 ee 25

Trang 11

3 Proposing Methods|

3.1 eTD3l Ặ eee

3.2 CEM-SAC|

3.2.1 EnhancingSACwithCEM|

3.3 Comparison with previous

worksÌ -4 EXPERIMENTS 4.1 Mujoco Continuous Control Benchmark|

4.2 Evaluation

Protocoll -4.3 Implementation

Detailsl -4.3.1 Proposedmethodsl

4.3.2 Baselinealgorithmsl

4.4 Resulsl ẶẶẶ SỐ VỐo 441 elD223vsTID23]

44.2 CEM-SACvsS5AC]

4.5 Discussion| ẶẶẶẶ ee 1 Summary] pe eee 5.2 Limitations) 0.20 00000000000 0004 5.3 Future DirecHonsl

Bibliography|

29

29 32 32 34

38 38 39 40 40 41 41 41 43 46

49

49 50 50

52

Trang 12

List of Figures

1.1 Markov Decision Process (MDP) Agent interacts with

environ-¬ ee 3

1.2 A screenshot of the Humanoid in MuJoCo Physics Simulation.

In Humanoid, we have to teach our agent to stand up and

move forward as fast as possible| 5

2.1 A parameterised policy function| 9

2.2 Actor Criticarchitecturel - 17

2.3 Atypical EA algorthml| 23

2.4 IHustraton o£CEM-RLÌ 27

3.1 SchemaofelD3] 30

3.2 Illustration of CEM-SAC algorithm| 37

4.1 MuJoCo locomotion tasks: Ant-v3, HalfCheetah-v3, Walker2d-[ v3, Hopper-v3, Humanoid-v3, HumanoidStandup-v2] 39

4.2 Average performance results on MuJoCo-v3 environments of [ eTD3, CEM-TD3, and TD3 over 10 independent runs] 42

4.3 Average performance results on MuJoCo-v3 environments of CEM-SAC, CEM-TD3, and SAC over 10 independent runs| 45 4.4 10 runs of CEM-SAC and 10 runs of SAC CEM-SAC shows

more stability than SAC over the training session, with much

less pitfalls) 0 eee 47

Trang 13

List of tables

4.1 Network Architecturel - 40

4.2 Hyperparametervalue|l - 40

4.3 Final performance of eTD3, CEM-TD3 and TD3| 43

[4.4 Results of statistical testsl - 43

[4.5 Final performance of CEM-SAC, CEM-TD3 and SAC] 44

4.6 Student t-test taken across 10 executions of CEM-SAC, SAC and CEM-TD3.| 44

4.7 Final performance of CEM-SAC, CEM-TD3 and SAC (version 2)| 46

Trang 14

DRL Deep Reinforcement Learning

EA Evolutionary Algorithm

EC Evolutionary Computation MDP_ Markov Decision Process CEM Cross Entropy Method TD3 Twin Delayed Deep Deterministic Policy Gradient

SAC Soft Actor Critic

DDPG Deep Deterministic Policy Gradient

ES Evolutionary Strategies ERL Evolutionary Reinforcement Learning

Trang 15

In this thesis, we enhance the performance of the twin delayed deep ministic (TD3) and soft actor-critic (SAC), i.e., state-of-the-art off-policy policy

deter-gradient algorithms, by coupling it with the cross-entropy method (CEM), i.e.,

an estimation-of-distribution algorithm The hybrid approaches, eTD3 and CEM-SAC, exhibit both the efficiency of policy gradient algorithms and the stability of CEM in training policy neural networks of reinforcement learning agents for solving control problems.

Our work here extends the evolutionary reinforcement learning (ERL) line

of research on integrating the robustness of population-based stochastic box optimization, that typically assumes little to no problem-specific knowledge, into the training process of policy gradient algorithms, that exploits the sequential decision making nature for efficient gradient estimation Exper- imental result comparisons with the baselines TD3, SAC and CEM-TD3, a recently-introduced ERL method that combines CEM and the TD3 algorithm,

black-on a wide range of cblack-ontrol tasks in the MuJoCo benchmarks cblack-onfirm the hanced performance of our proposed CEM-SAC.

Trang 16

1.1 Introduction

Reinforcement learning (RL) is a machine learning paradigm to train cial intelligence (AI) agents to accomplish a certain task when there does not

artifi-exist a dataset of training examples with corresponding ground truth labels.

It is not straightforward, and might be not realizable, to effectively construct such a dataset in many control problems, such as robotics, inventory manage- ment, real-time bidding Designated tasks often involve the optimization of

a reward/objective function RL algorithms train the agents via an iterative

exploration-exploitation mechanism [13] At each iteration, experiences,

of-ten in the form of (state, action, reward, next state), are obtained by interacting directly with the environment These collected training signals are then em- ployed to favorably adjust the trial-and-error interactions of the agents at sub- sequent iterations In recent years, the couplings with deep neural networks (DNNs) and other deep learning (DL) techniques have enabled RL to achieve

successful applications in a wide range of complicated problems with

high-dimensional state spaces and action spaces Policy neural networks, that ommends which action agents should perform given each input state, are often trained by a class of deep reinforcement learning (DRL) algorithms called policy gradient methods Stochastic gradients of reward functions with re-

rec-spect to the policy parameters can be used to update network weights at each

Trang 17

gradient ascent iteration.

Deep reinforcement learning (DRL) algorithms, however, still suffer from some essential obstacles: credit assignment with sparse/delayed reward signals, premature convergence due to difficulties in maintaining a meaningful

diverse exploration, sensitivity to hyperparameter settings (9}(14] Salisman et

al proposed to treat the training of an agent’s policy network as a

black-box optimization problem, which can then be solved by evolution strategies

(ES), in particular natural evolution strategies [26], with competitive results

compared to state-of-the-art DRL algorithms Notable advantages of tionary computation (EC) as a scalable alternative to RL have been reported

ac-¢ 3) EC methods are typically robust, and competitive results can be tained with the same fixed hyperparameters in multiple environments rather than having to perform intensive problem-specific hyperparame-

ob-ter tunings.

However, zeroth-order optimization techniques like (classic) EC methods

might have slow convergence speed and exhibit sample inefficiency when

the problem dimensionality is large [11] It is the case that DRL agents’ policy

neural networks often have a huge number of parameters that need to be trained Recent studies have been made to combine the strengths of both DRL and EC methods to efficiently and reliably train policy networks in solving

control problems [9} [14].

Trang 18

Action vector a ~ Z¿(s)

>

AGENT ENVIRONMENT

710 <

Reward #(s, a), next state s”

FIGURE 1.1: Markov Decision Process (MDP) Agent interacts

with environment

1.2 Continuous control problem

Optimal control is a problem of designing a controller to optimize the sired behaviour of a dynamical system over time Continuous control problem is an optimal control problem where controller outputs control signals that are continuous values based on received state signals In modern Rein- forcement Learning literature, continuous control problem is usually formulated as discrete-time stochastic control, known as Markov Decision Process (MDP).

de-In these problems, there involves (an) intelligent agent(s) making decision

in an environment with uncertain elements Each agent’s behaviour is resented by its policy 7, which should be searched to maximize expected returns from the environment Most of the time, a dynamic model of the

rep-environment, or function that describes transition between agent states, is

ex-tremely complex to build and, thus, treated as a black-box function.

In many approaches, control problem can be formulated as a policy search

problem, in which policy function 7r(a|s) describes the agent’s behaviour We

want to find the policy that maximizes some objective function that represents the type of behaviour we want our agent to learn In our thesis, we parame- terize the policy function using artificial neural network and adopt machine learning approach to automatically find our control policy (more detail on chapter|2).

Trang 19

¢ Gaming: Artificial Intelligence is currently adopted by gaming

indus-tries to improve player experience The most related application is to develop adaptive and responsive non-player character (NPC) Adopting

a policy search method would reduce the need of hard-coding certain behaviour of NPC (walking, running, sprinting, interacting with human player).

1.4 Challenges and open problems

Our thesis focuses on using the framework suggested by Model-free forcement Learning to teach intelligent agents tacit knowledge such as walking and running The term ‘model-free’ suggests that our intelligent agents

Rein-do not have access to a dynamic model of the environment, which is used to

predict future states based on current action and current state In other words, agents are not fed any physical knowledge about the environment, which is also non-trivial to construct in such complex tasks, other than the state vector containing its own joint Cartesian coordination, angles The agents have

to collect and process data from the environment by itself, then use data to

figure out the physical rules of its environment and learn to do a set of tasks such as standing up, walking or running.

Trang 20

FIGURE 1.2: A screenshot of the Humanoid in MuJoCo Physics Simulation In Humanoid, we have to teach our agent to stand

up and move forward as fast as possible.

While applying such framework in training a physical robot is highly ing, current methods have not achieved the sample efficiency that is acceptable to real-world settings For example, no research institution would risk breaking their expensive robots by allowing it to try and fail a million times.

tempt-As a result, current research often needs to carefully experiment with the

al-gorithms to investigate their sample efficiency in simulation environments before rolling out in real-world applications Our thesis is also conducted in this direction.

Compared to classic approaches in continuous control, parameterizing a controller with deep neural network gives rise to optimization problems with

extremely high dimensionalities Naively applying modern EC methods would

certainly cause memory issues or incur exceedingly long running time thermore, noises created by stochastic elements in some control problems hinder the convergence of optimization algorithms Effectively exploring in

Trang 21

Fur-noisy environments while maintaining an acceptable sample efficiency is a

non-trivial challenge.

1.5 Our contributions

Our thesis extends the research line of combining Evolutionary Computation

and Deep Reinforcement Learning for Continuous Control problem:

¢ We survey and analyze current state-of-the-art algorithms in

Evolution-ary Deep Reinforcement Learning

¢ We our proposed combining mechanism, which is an improvement upon

the mechanism proposed by Pourchot et al.

¢ We apply our proposed mechanism on a new combination of Cross

En-tropy Method and Soft Actor-Critic

¢ We conduct experimentation on MuJoCo benchmark to empirically prove

stability and efficiency of our algorithms.

1.6 Structure of thesis

The remainder of this thesis is organized as follows.

° Chapter | briefly reviews the background knowledge that is related to

our work, including the policy search problem, based and

gradient-free approaches to policy search.

¢ Chapter BỊ] introduces our proposed algorithms, eTD3 and CEM-SAC.

We also provide comparisons in concepts between our algorithms and

related algorithms.

Trang 22

our study Based on experiment results, we draw comparisons to line methods and discuss possible extensions to our work.

base-° Chapter[5|concludes our study and suggests potential directions for our

future works.

Trang 23

In this chapter, we revise the most essential knowledge behind our work In the first section, we explain policy and policy search in Reinforcement Learn-

ing Literature The remaining sections are used to discuss gradient-based

approaches (i.e policy gradient methods) and gradient-free approaches (i.e Evolutionary Computation).

2.1 Policy Search

The core of an agent is its policy (or controller) 7r that dictates which action

at € A to take given a state (or an observation) s; € S at each time step f An instant reward r; is feed back to the agent, and the environment is transitioned

to the next state s,,ị The sequence + = (S0,đọ,f0,S1,- „St, đt, Tt, ¡+1 - - -)

forms a trajectory of the agent through the state space up to a horizon t = H,

the return of trajectory is G(t) = LiLo *r, and the return after time steps t can be obtained G; = an okt, where + € (0, 1] is a discount factor [4] Be-

cause there typically exists certain randomness in the environment (e.g.,

tran-sitions and rewards follow some probability distribution Pr(s¿,1,7;|s¿„;)),

and the agent can have a stochastic policy (i.e., a random distribution over the

action space conditioned on the state space 7r(4;|s¿) = Pr(z;|s¿)), the agent’s

trajectories would be different from time to time RL aims to train the agent

to obtain the maximum expected return J(7t) = Er~[G(T)].

Trang 24

VECTOR

The agent’s policy can be parameterized by a parameter vector Ø, and in DRL, 0 would be the weight vector representing a deep neural network 71ạ The training process in RL is then equivalent to employing a search algorithm

to find a policy parameter vector 6* yielding an optimal policy 7rạ: that is able

to maximize the expected return

6* = arg max J(ạ) = argmax [EE [G(T)] = argmax [ G(t)Pr(t| 719) dt

9 6 UE) 6 T

(2.1)

where the expectation is computed by sampling, and Pr(t|@) indicates the

probability of a trajectory 7 given a policy 79.

In reinforcement learning, alongside a policy, there also are two

defini-tions that are important: Value function V%(s) and state-action value function

Q7(s,a).

¢ Value function is used to evaluate value of a state given that agent strictly

follows policy zr Definition of value function is given in|2.2]

* State-action value function is used to evaluate quality of a state-action pair given that agent strictly follows policy 7 Definition of state-action

value function is given in2.3]

Trang 25

V*(s) = E[Gi|S+ = 5] = EI 'n|S = 5] (22)

k=t

QTM(s,a) = E|Gi|S¡ = s, Ar = a] = B() } "4/5 = s, Ar = a] (2.3)

ket

2.1.1 Stochastic policy and deterministic policy

In DRL literature, there exists two types of policy: deterministic policy 7Zr(s)

and stochastic policy 7t(a|s).

* Deterministic policy 7r(s) is a function that receives state vector and

out-put action vector Algorithms that use this type of policy include

Deter-ministic Policy Gradient [20], Deep DeterDeter-ministic Policy Gradient

and Twin Delayed DDPG [5].

¢ Stochastic policy 7(a|s) is a distribution over action space conditioned

over state space Algorithms that use this type of policy include

Stochas-tic Actor-CriStochas-tic[3] and Soft Actor-criStochas-tic (SAC)[6

Next, we describe two popular families of policy search algorithms, policy gradient and evolutionary computation.

2.2 Policy gradient methods

Policy gradient methods are a family of RL algorithms that relies on

gra-dient computation of policy function VạJ(7r¿) to perform policy

improve-ment In this section, we firstly summarised the works of Williams [27] ,

Sut-ton et al and Silver el al [20], which propose feasible gradient

computa-tion methods for modern model-free DRL algorithms.

Trang 26

Williams[27] propose an approach to compute gradient of a policy function

in MDP formulation In policy search, the objective is to find the policy that maximizes expected reward obtained through a decision sequence T =

Next, we need to expand Pr(r|7rạ) to derive a practical formula Using chain

rule in probability , we obtain

=z

Pr(r|7rạ) = øo(so) | [ Pr(se+1\st, ar) 7t@(atls+) (2.5)

¡=0

Where ø(sọ) is the initial state distribution and Pr(s 1 |s¿, a) is state transition

probability Using product property of logarithm, we have:

Trang 28

Vol (me) = E_ [Valogra(si|a:)Q”*(s¡,a¡)]Tum

= ba! Q” (st, ar) Velog7te(selar) en i=

where Q”*(s,a) is the action-value function of policy Zrạ, which indicates

the expected return that can be achieved if the agent performs action a in

state s and then following policy 7g thereafter (22) From here there are two

approaches to perform policy improvement using equation above:

¢ Monte Carlo updates: Run a full trajectory and average over collected samples.

¢ Temporal difference: Approximate a parameterized Q”2(s,ø) and

elim-inate the need of running a full trajectory This is proven to be possible

using a more general policy gradient formations [22].

Vo] (9) = Exy|QTM Vo In 709(a\s)] (2.9)

2.2.2 Deterministic Policy Gradient (DPG) Theorem [20

While the Policy gradient theorem offers a feasible way to compute gradient

of a stochastic policy 7t9(a|s), DPG theorem derives gradient of a tic policy 7tg(s) DPG Theorem is the basis of policy learning in many mod-

determinis-ern DRL algorithms that aim to optimize a deterministic policy, such as DPG,

DDPG, TD3.

Trang 29

For clarification and simplicity, we define related variables as follows:

® 79(s) = 7(s): deterministic policy parameterised by Ø

® Vz,(s) = V(s): value function of state s under given policy

® Qz„(s) = Q(s,a): action value function of state-action pair (s, 4).

S: state space

® +: discount factor.

P(s'|s,a), P(s — s’,t, 79) : probability of transitioning from state s to s”

by taking action a

J() = V(sọ): the objective function in episodic MDP is the value

func-tion of the first state in a episode, following policy 7r

We have:

VV(s)

= VạQ(s,8) V(s)=Q(s,a) in deterministic policy

= Ve (r(s, 71(s)) + ElzV(s')]) definition of Q(s,a)

Trang 30

= I, (xVaP(sls, z(s))V(s') + +P(s'ls, z(s))V¿V(s')) 4s! product rule

= | (1Yax(s)VaP(sfls,)|,~x(s)V (3) + +P( Is,z(s))V¿V(s))ds chain rule

Notice the highlighted common term Vạ7Zr(s)Vz in A and B, we continue

to derive the gradient of value function:

VV(s)

= Vumt(s)Wu (r(s,a) + | (xP(sls,a)V(s)43)|_„¿) + [_ TP s,(s))VaV(s))) 4#

= Vạ7r(s) VaQ(s,8)Ì¿—z~(s) + [_+P(sls,(s))VaV(s)ds' definition of Q(s„z)

If we keep unrolling V(s/) in the above equation, we can rewrite the

equa-tion with the following form:

VV(s) = j „` y'P(s — s”,t, xạ) Vạ7t(s) VaQ(s,4) |a—n(s) 48" (2.10)

—

Plugging Eq into VJ (70), we have:

= | Pi(s)VaV(s)đs p1(s) does not depends on Ø

= | rils)Vo | LPC > sut,4)Wo/x(s)VzO(s,4)|,=m(sj đề

cận s) Vạ7r(s )VaQ(s,) |¿—x(s)đs

Ex|Va7r(s)VaQ(s,4)]

Trang 31

2.2.3 Actor-Critic architecture

In model-free DRL, we can choose to improve agent’s policy by directly

op-timizing a parameterized policy or learning a value function (12 which

is used to estimate action values in given states (i.e., from which a policy can

be inferred) However, it could be useful to learn both a value function Q„and a policy function 7tg simultaneously A policy function 7rạ, or actor, gen-

erates action a from given state s, then executes a to gain next state s’ , reward

value r and termination indicator d € {0,1} A tuple of (s,a,r,s',d) is called

an experience, similar to a data point, which is used to compute gradient

up-dates for the value function, or critic The critic considers the state and theaction chosen by the actor, and then computes a Q-value that is used to up-date the actor’s parameter vector 6 This entire process can be summarised

by the following equations:

8 — 6 + a¿Q¿(s„4)Vạ In 7(a|5) (2.11)

Ot = rị + +Qạ(s”,a') — Qo(s,a) (2.12)

ÿ — @+agôrV¿Q¿(s,a) (2.13)

The Actor-Critic architecture has several attractive properties [19]:

® Accuracy and scalability: Actor-Critic methods represent their policy

function 7tg and value function Qạ as non-linear deep neural networks

As a result, DRL offers an approach to tackle problems with extremelylarge state space and action space This is considered one of the mainstrength of current algorithms in DRL in compared to traditional meth-ods in robotic control

¢ Stability: most algorithms that implement actor-critic architecture make

use of a target critic(s) alongside the main critic(s) The target criticserves as a lagged version of main critic and gets updated less frequently

Trang 32

FIGURE 2.2: Actor Critic architecture

in order to stabilize the algorithm As for why target critic can stabilizeDRL algorithm, we refer to the work of Lilicrap et al

2.2.4 On policy algorithm vs off policy algorithm

Current algorithms in model-free Reinforcement Learning are divided intoon-policy algorithms and off-policy algorithms The difference between twoclasses of algorithms is characterised by the way DRL algorithms process its

collected data.

On policy algorithms Algorithms in this family relies on Monte Carlo pling, which collects a large number of samples, to evaluate current policy.Optimizing the policy network is a common regression problem: Given thesamples (where state-action pairs are examples and reward values are labels)

sam-in msam-ini-batches, fsam-ind the optimal network parameters that maximize expected

Trang 33

reward function Afterwards, those samples are discarded because they are

no longer related to the newly-updated policy

Off policy algorithm Algorithms in this family assume that any sample

generated by any policy function (ie past policy function) can be used tooptimize current policy function As a result, a data storage, known as replaybuffer, is adopted to store the samples that are already used in optimizingpolicy function Mini-batches of samples from replay buffer are then re-used

to update the current policy function Thanks to the use of replay buffer, policy methods often achieve higher sample efficiency than on-policy meth-ods

off-Powerful properties of Actor-Critic architecture combined with samplereuse mechanism from off-policy methods are the main ingredients of cur-rent state-of-the-art DRL algorithms in continuous control domain

In our thesis, we experiment with TD3 and SAC, which are two popularalgorithms in off-policy methods that use Actor-Critic architecture The maindifference between the two algorithms is the nature of their actor TD3 make

use of a deterministic actor 7t9(s), while SAC prefers stochastic actor 7tg(a|s).

As actor is the equivalent of policy, distinction between two types is given in

section

2.2.5 Twin Delayed Deep Deterministic Policy Gradient (TD3)

TD3 [5] is a model-free off-policy DRL algorithm, in which an actor (or icy network/controller) and two critics (or value functions) are learned at the

pol-same time Every time agent makes a decision in the environment, the

expe-rience is stored in an off-policy replay buffer Those data is then sampled intomini batches and used to update the critics with a modified Bellman equation:

y(r,si,d) =r + y(1~ d) min Qy(s',a’)

Trang 34

aw clip (7t97(s’) + €, Amin, Amax )

e’ ~ clip(M(0,ơ),—c,c)

where action a’ is sampled from the target actor network 7(s’) with some

target policy noise e’ The clip operations are to ensure that actions and noises

are within allowable ranges [4min, @max] and [—c,c], respectively The

param-eter vector 6’ of the target actor 7rạ is updated based on Ø after every certain

number of time steps The critics are used to update the actor using DPGtheorem equation:

Vol (74) = Ex[Va7r(s)VaQ(s,4)] (2.14) Ex|Ve

E7|V@Q(s,a)| (2.15)

Algorithm{2|summarizes the pseudo-code of TD3 adapted from [1].

2.2.6 Soft Actor-Critic

Algorithms that belong to this family can be implemented in an on-policy

setting (i.e., experiences sampled under policy 7tg are discarded after being

used to approximate VạJ(7rg)) (18) [17], which is sample inefficient but offers a

higher stability, or off-policy setting (i.e., experiences sampled under different

polices are can stored in a buffer and are later used to estimate VạJ(7rạ)) [24

5], which has a higher sample efficiency but sensitive to many rameters [7] Soft actor-critic (SAC) [6] is an off-policy algorithm that is built

hyperpa-on the actor-critic framework SAC introduces an entropy regularizatihyperpa-on terminto the RL objective function and a policy should learn to maximize the fol-lowing function:

Trang 35

Algorithm 1 Twin Delayed DDPG (TD3)

1: Input: 6, $1, $2, replay buffer D

2: max simulation steps H, the number of updates per step U

3 Actor update frequency freq

4: Initialize: 7rạ, Qại, Q¿,, Oe 0, Pi-19 — Ó¡~1„2

5: fort = 1 to H do

6: Execute action a; ~ clip(7t9(st) + €, amin, Imax)

where e ~ NV (0,0) is some exploration noise.

ụm Observe next state s¿¡ 1, reward r¡

9: for7=0toLI— 1do

10: Sample a batch of data B 7 and compute y(r,s’,d)

11: Run 1 step gradient descent on critics

1

Vui }È, (Q,~ y(r,s',d))? fori = 1,2 | | (s,a,r,s!,d)€

12: if j mod freq = 0 then

13: Run 1 step gradient ascent on actor

Tiêu đề	Design And Application Of Evolutionary Reinforcement Learning Methods For Continuous Control
Tác giả	Nguyen Trung Hieu, Tran Dinh Khang
Người hướng dẫn	Dr. Luong Ngoc Hoang
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Bachelor Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	70
Dung lượng	26,78 MB