Báo cáo nghiên cứu khoa học: Studying deep reinforcement learning and application for trading on Vietnamese stock market

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL STUDENT RESEARCH REPORT Studying Deep Reinforcement Learning and Application for Trading on Vietnamese stock market... Research

Research Topic

English: Studying Deep Reinforcement Learning and Application for

Trading on Vietnamese stock market

Vietnamese: Nghiên cứu học tăng cường và ứng dụng cho giao dịch trên thị trường chứng khoán Việt Nam.

Student’s Information

Name Student ID Class Program Year

Doãn Văn An 21070236 BDA2021B Business Data Analytics 3rd Phan Gia Bảo 21070849 BDA2021B Business Data Analytics 3rd

Lê Thúy Huyền 21070410 BDA2021C Business Data Analytics 3rd

The stock market has long been an attractive investment channel for consultants However, in the Vietnamese market, stock investing is still quite young for most people

At the present time, when other investment channels such as real estate and bank savings are gradually no longer attractive, profits and benefits for investors are decreasing, the stock market has emerged and attracted attention through many social networks, social media as a solution for investors However, because this is the first time participating in this risky investment market, many investors are still confused about how to get started Realizing this, we came up with the idea of building a trading bot model using

Deep Reinforcement Learning algorithms, including Vanilla DQN, Fixed-Target DQN, and Double DQN, empower trading bots to automate investment decisions These bots eliminate the need for manual intervention, allowing investors to delegate capital to the algorithm The algorithm leverages market data to develop optimal trading strategies that maximize returns.

INTRODUCTION

Stock trading has a long history dating back to around the 15th century [1] when in bustling commercial cities of the West, merchants often gathered at coffee shops to negotiate the purchase and sale of a variety of goods such as agricultural products, assets, minerals, foreign currencies, and real estate stocks Along with the development and recovery of the economy after the COVID-19 pandemic, the frequency of stock trading and trading activities has increased the attention This is an investment channel that brings extremely large profits, but it is also full of risks Therefore, the need for investors to find the most suitable and effective investment path is increasingly focused and Artificial Intelligence (AI) is gradually being applied in stock trading

In recent years, interest in Artificial Intelligence (AI) has grown rapidly, with many research papers published every year Artificial Intelligence is emerging as an effective method in stock trading, helping investors optimize profits and risks Technological aspects of AI are gradually being applied such as machine learning, reinforcement learning, deep learning and natural language processing for risk detection, risk assessment, transactions and 24/7 customer service The application of Reinforcement Learning (RL) in financial market trading has attracted significant attention interest and discovery in recent years In the field of algorithmic trading, RL techniques, especially Deep Q Learning, have emerged as a promising avenue for creating intelligent trading systems [2] The combination of RL methods with stock market dynamics represents a frontier where machine learning intersects with finance, seeking to leverage data-driven strategies for decision-making in trading [3]

Financial trading involves managing capital, creating value, and understanding markets, risks, and investment strategies In Vietnam, as the stock market gains prominence amidst a stagnant real estate market and low bank interest rates, investors are exploring this investment channel, highlighting the need for a comprehensive understanding of its complexities.

Therefore, we conducted this research based on Artificial Intelligence to delve into the application of Deep Q Learning in stock trading on the Vietnamese stock market Specifically, we build a trading bot which relies on the initial investment amount to make the most effective automatic trading decisions, bringing optimal profits on a stock code Thanks to that, based on the trading bot's output profit calculation, even amateur investors can invest in stocks and securities instead of freezing cash flow, and at the same time help investors feel secure with their investment.

LITERATURE REVIEW

Regarding the application of Reinforcement Learning (RL) algorithms in trading strategies for continuous futures contracts, the article "Deep Reinforcement Learning for Trading" by Zihao Zhang, Stefan Zohren, and Stephen Roberts at the Department of Engineering Science, Oxford-Man Institute of Quantitative Finance, University of Oxford [4] delves into the realm of Deep Reinforcement Learning (DRL) algorithms applied to the development of trading strategies for continuous futures contracts This study emphasizes that RL algorithms can outperform traditional models and still yield profits with significant trading costs The researchers explore both discrete and continuous action spaces, incorporating volatility scaling into the reward functions to adjust trade positions based on market volatility levels By presenting the cumulative trade returns over multiple years on a dataset comprising the 50 most liquid futures contracts spanning from 2011 to 2019 for each asset class and strategy with the methods used in the article, specifically Deep Reinforcement Learning (DRL) algorithms such as Deep Q-learning Networks (DQN), Policy Gradients (PG), and Advantage Actor-Critic (A2C) compare with traditional strategies (Long, Sign(R), MACD) across various asset classes

Figure 1: Performance metrics of different trading strategies

The study's findings highlight the superiority of Deep Reinforcement Learning (DRL) algorithms, particularly Double Deep Q-Network (DQN) and Advantage Actor-Critic (A2C), in financial trading These algorithms effectively generate substantial profits and ensure positive returns, even under challenging conditions involving high transaction costs The comparison of Sharpe ratios underscores the robust performance of DRL algorithms relative to baseline models and traditional trading approaches.

DRL algorithms such as DQN and A2C demonstrate superior risk-adjusted returns, effectively managing risks and generating profits Their robustness is evident in capturing long-term trends, shedding light on their adaptability to diverse asset classes DRL techniques have the potential to outperform traditional strategies in real-world applications, particularly in managing transaction costs LSTM neural networks, commonly used in natural language processing, are employed in the modeling of agent and critic networks in DRL algorithms.

The other study “Beating the Stock Market with a Deep Reinforcement Learning Day Trading System” by Leonardo Conegundes and Adriano C Machado Pereira [5] delves into the innovative application of Deep Reinforcement Learning (DRL) in the realm of day trading within the Brazilian Stock Exchange (B3) The primary objective of the study is to harness the power of DRL algorithms to optimize asset allocation strategies by executing buy and sell transactions exclusively within a single trading day By training the DRL system in simulated environments and considering critical factors like liquidity, slippage, and transaction costs, the research aims to surpass traditional trading methodologies The trading system proposed in the research uses a Deep Deterministic Policy Gradient (DDPG) algorithm to solve a series of asset allocation problems This algorithm helps define the percentage of capital that must be invested in each asset at each period DDPG is a model-free, off-policy actor-critic method that can learn policies in high-dimensional and continuous action and state spaces, like those typically found in financial market environments The algorithm is based on the Deterministic Policy Gradient (DPG) method, combining the actor-critic approach with insights from the success of Deep Q Network (DQN) The DDPG combines the value-based DQN and the policy-based DPG for large continuous domains, where the actor learns using the Bellman equation based on feedback from the critic Through a rigorous evaluation process, the research team compares the performance of the proposed DRL-based day trading system against a range of benchmarks, including the renowned Ibovespa index and top-performing stock portfolios recommended by prominent Brazilian financial institutions The empirical results obtained from testing the system over a period spanning from 2017 to 2019 reveal that the DRL algorithm not only managed to generate alpha but also significantly outperformed the benchmark portfolios considered in the study

Figure 2: Annual Returns from 2017 to 2019 of 3 DRLs methods and 10 benchmarks

The study underscores several key advantages associated with employing DRL in this context, including the ability to achieve end-to-end optimization, the intelligent creation of trading systems through direct policy learning, and the capacity to train effectively in simulated environments that account for crucial market dynamics such as liquidity, latency, slippage, and transaction costs By leveraging the capabilities of DRL to address complex decision-making challenges characterized by high-dimensional state-action spaces, particularly prevalent in day trading scenarios, the research underscores the transformative potential of DRL in enhancing trading strategies and driving alpha generation in financial markets However, the study still has some limitations One significant limitation lies in the reliance on historical market data, as the effectiveness of the DRL-based day trading system is contingent upon the quality and availability of such data Inadequate or biased historical data could potentially hinder the system's performance in real-world trading scenarios Moreover, the study operates under ideal market conditions assumptions, including zero slippage, zero market impact, and sufficient liquidity However, real market conditions are dynamic and may not always align with these ideal assumptions, potentially affecting the system's performance The study's restriction to a limited set of assets for trading, comprising the ten most traded stocks from the previous year, may not fully represent the diversity of assets available in the market Effective risk management strategies are also essential considerations that warrant further exploration within the DRL-based trading system to mitigate potential losses and ensure long-term viability.

METHODOLOGY

Based on the idea of applying Deep Reinforcement learning to the stock trading problem (Adaptive Stock Trading Strategies with Deep Reinforcement Learning Methods), we have carefully researched the existing factors of previous research articles on input data, neural network architecture, action strategies, model algorithms to change, find new directions for our trading bot Considerations about changing the input state format of the neural network, implementation strategies to suit reality, changing the neural network structure or testing on many Deep Q Network algorithms are applied.

Markov Decision Process

In a Reinforcement Learning problem, an agent needs to choose the optimal action based on its current state This process, when repeated, is also called Markov Decision Processes (MDP) MDP is a framework that helps agents make decisions at a certain state In fact MDPs have become the de facto standard formalism for learning sequential decision-making [6] The components of this MDP model include sets of states S, actions a, reward function R(s,a), sets of models and policies of the MDP model At any time step t, the goal of RL is to maximise the expected return [4]:

] where (𝛾) determines the importance of future rewards in the agent's decision-making process A value of 1 indicates that the agent considers future rewards equally as important as immediate rewards, whereas a value closer to 0 makes the agent consider only immediate rewards.

Experience replay

In the Deep Q learning algorithm, we use a technique called experience replay during training In an episode which represents a complete run-through of the training process, the agent starts from an initial state, takes actions by choosing exploitation or exploration based on epsilon , observes rewards, and learns from its experiences Epsilon (ε) represents the exploration-exploitation trade-off in reinforcement learning Then, we save the agent's experiences at each moment in a data set called replay memory At time t, the agent's experience is defined as a tuple:

Figure 4: Figure of sample of replay memory

• 𝑠 𝑡 is the current state of the environment

• 𝑎 𝑡 is the action taken in that state

• 𝑟 𝑡+1 is the reward after taking action

• 𝑠 𝑡+1 is the next state of the environment

This data set will be randomly sampled instead of sequentially selecting samples to put into the neural network and model for training, helping to avoid the existence of correlations between samples In this way, the training results of the neural network will be improved.

Deep Reinforcement Learning Algorithms

In Q Learning, the input will be 𝑆 𝑡 (state S at time 𝑡) and 𝑎 𝑡 (action 𝑎 at time 𝑡) After the exploration and exploitation process, all possible combinations of states and actions are gathered and we obtain a table Q in which the rows are states and the columns are actions From there, we can take actions for each state that bring about the optimal Q value, or determine the optimal policy However, if the number of combinations of states and actions is too large, the memory and time for calculating the Q table will increase, especially with infinite data such as time series data such as stock prices securities in our topic Therefore, Deep Q Network serves as an alternative to conventional Q learning as DQN approximates a Q value function to estimate how good an agent is at performing a certain action in a certain state by applying neural networks which called policy network (𝜃) This will help reduce memory and computation time compared to

For each given state input, the network outputs estimated Q values for each action that can be performed from that state The main goal of this network is to estimate the optimal Q function, and the optimal Q function will have to satisfy the Bellman equation [4]:

• 𝑄 ∗ (𝑠, 𝑎) is the opimal action-value function

𝑄 ∗ (𝑠′, 𝑎′) is the maximum expected return that can be achieved from any possible next state-action pair

The network is trained to minimize the difference between the predicted Q-value and the target Q-value, which is the objective function [4]:

Vanilla DQN algorithm is built as follows:

2 Initialize the neural network with random weights

2 Observe the reward received and the next state from the selected action

3 Store experiences in replay memory

4 Sample random batch from replay memory

6 Pass batch of preprocessed states to policy network

7 Calculate loss between output Q-values and Q-target

+ Requires a second pass to the policy network for the next state

8 Gradient descent updates weights in the policy network to minimize loss

Figure 5:Diagram of Vanilla DQN Deep Q Network with fixed Q-target

The DQN with fixed-target algorithm offers a solution to this situation That is to build a neural network with the same structure as the original neural network (policy network), also known as the target network (𝜃′), used to calculate a separate target Q The weights of this target network will be updated at certain intervals So, we're able to obtain the Q- value for the next state, then plug back into the Bellman equation in order to calculate the target Q-value for the first state

DQN with fixed Q-target algorithm is built as follows:

+ Requires a pass to the target network for the next state

8 Gradient descent updates weights in the policy network to minimize loss

+ After x time step, the target network will have its weights updated according to the policy network

Figure 6: Diagram of DQN with fixed Q-target Double Deep Q Network

Both DQN algorithms above have a drawback, which is overestimation of Q-value That is, after the training process, the model will select the highest Q-value result However, sometimes Q-value does not necessarily bring high efficiency, and can even affect the results and performance of the model Double DQN is an algorithm that does not select the highest Q-value but will focus on selecting Q-value based on action selection and evaluation At this point, the model will still be composed of two neural networks: the policy network and the target network, but the target Q will no longer depend on the largest Q value, but will depend on the action that brings the highest Q value in the policy network [8]:

This prevents the model from overestimating the Q value, increasing the stability of the model DQN with fixed-target algorithm is built as follows:

+ Requires a second pass to the policy network for the next state

+ Choose the action has highest Q-value of the next state in policy network

+ Requires a pass to the target network for the next state + Choose the Q-value in target network based on the chosen action

Figure 7: Diagram of Double DQN

Agent of trading bot

The Agent class contains a deep reinforcement learning model, which is used to estimate the best action for a given state Several properties and methods are defined in this class to manage stock trading and model training The indexes and model parameter values in this Agent object are selected and adjusted so that the output quality is stable and optimal

Some basic typical attributes that must be present in a trading bot when applying Deep Reinforcement learning:

State_size Size of input state of the environment, which helps the trading bot come up with a next action strategy

Action_size The number of stock trading actions that the trading bot can perform Model name Names of deep reinforcement learning models used for trading

Inventory List of assets (stocks) that the agent is holding

Memory A repository to store the model's states, actions, rewards, and next states for use in training

In addition to the properties of the agent, we also set up hyperparameters, loss functions and algorithms of the model, helping the model run stably and optimally:

A gamma of 0.95 suggests that the agent has a high affinity for long-term rewards, meaning it values future rewards almost as much as immediate ones

At the beginning of training, epsilon is set high (1.0) to encourage exploration As training progresses, it gradually decays towards epsilon_min

Epsilon_min (𝜀 𝑚𝑖𝑛 ) indicates that the agent will continue to explore with a minimum probability of 1% even after it has gained substantial experience Once epsilon reaches this value, the agent's policy becomes increasingly deterministic as it relies less on exploration and more on exploitation of learned knowledge

Epsilon_decay (𝜀 𝑑𝑒𝑐𝑎𝑦 ) indicates that the rate at which epsilon decreases over time is relatively slow Specifically, it means that during training, the exploration-exploitation trade-off gradually shifts towards exploitation (taking actions based on the learned policy) at a moderate pace With a decay rate of 0.995, epsilon will decrease by 0.5% after each training iteration (or episode), allowing the agent to explore less as training progresses

However, it still retains some level of exploration throughout training to ensure that the agent continues to explore new possibilities and refine its policy

Learning rate (𝛼) is relatively small, suggesting a conservative approach to parameter updates, prioritizing stability, and smooth convergence during training Which means that the updates to the parameters will be conservative, however, it might also mean that training takes longer to converge to an optimal solution compared to higher learning rates Choosing an appropriate learning rate is crucial in training neural networks

Since we use personal computers and the training process consumes a lot of computing resources, keeping it to only 50 episodes, although setting it high may yield better results

The optimization algorithm, in this case Adam optimizer, employs a combination of Momentum and RMSprop to effectively adjust the model's weights during neural network training These parameters collectively shape the learning process and adaptive strategy of the trading bot as it interacts with the trading environment Optimizing these parameters can substantially impact the bot's performance and training trajectory.

DQN with fixed-target and Double DQN introduce two additional parameters: the number of interactions and the reset frequency of the target network These parameters influence the stability and performance of the algorithms as they interact with two separate neural networks.

The Adam optimizer plays a crucial role in training neural networks, including trading bots It utilizes a combination of Momentum and RMSprop to efficiently adjust the model's weights These weights are essential for the bot's learning and adaptation as it interacts with the trading environment By fine-tuning the optimizer's parameters, it is possible to significantly influence the bot's performance and training trajectory, optimizing its ability to make profitable trading decisions.

With two algorithms, DQN with fixed-target and Double DQN, in addition to the parameters and hyper-parameters set up as above, there are two other important parameters: the number of interactions and the frequency at which the target network is reset due to being made up of two separate neural networks.

Getting state

To get the data into the agent, we must convert the original time series data into a state representation that can be included in the model State is the second important part of DQN after the reward function which is the input to the neural network, or we can say to let the bot know where it is I set states that will reflect the state of the financial markets at a particular time based on historical price, and a multiplier that represents the number of shares still available to buy based on remaining capital Below are the detailed steps by which we set up the agent to get the states

First, we take the data from the dataset and convert it into a 1-dimensional array of 11 features include current price and previous 10 days

Figure 8: Extract data from dataset

When starting with the first data points there will not be 10 data points, if there is not enough data then the missing data will be filled in with the first data point of the data set

Figure 9: Extract data if not enough data points

We then apply the sigmoid function to the calculated differences, limiting the output value to a range 0 to 1 From there I can see the fluctuations and reduce the a of amount different states that the bot can be admitted Too much state will confuse the bot in identifying and not knowing what action to take [9]

• 𝑃 𝑡 is stock price at time t

The reason I divide by 300 is because the price fluctuation between 2 consecutive days is usually between -1000 and 1000, sometimes more So, if not divided by 300, the value of the sigmoid function is only 0, 0.5 or 1 When the volatility is 1000 or more, the bot will evaluate it as highly volatile with the value through our expression increasing slowly from 0.97 to 1 Hence, I tried 300 and found it to be a reasonable number in the Vietnamese market In different problems, the divisor can be adjusted to suit Here's how I converted volatility and trendlines into a vector with 10 features for the bot to recognize

Figure 11: Convert volatility to vector

Then I add an action multiplier so that the bot gets its remaining capital:

• 𝑀𝐵𝐴 𝑡 is coefficient max buy action can take

The result will be 0, 0.1, 0.2 or 0.3 As I stated above the bot can only choose to buy

300, 600 or 900 shares 0 corresponds to an action in the next state that cannot buy any shares, 0.1 corresponds to an action in the next state that can buy up to 300 shares, 0.2 corresponds to an action in the next state is to buy a maximum of 600 shares, 0.3 corresponds to the action in the next state to buy a maximum of 900 shares The sample output of the get state function is a vector with 11 features

Action strategies and reward function

The trading bot we build will provide 5 processing and training processes: Sell, buy in volume 300, 600 or 900 and hold We set the default initial capital to 100,000,000 VND The bot can only trade once a day at Open Price Training on personal computers forces us to reduce the amount of input actions because if we set up many types of actions, it will confuse the agent when training After purchasing shares, the quantity and price of the shares will be saved in the agent inventory When the bot decides to sell, it will sell all the shares

The reward function is pivotal in the DQN algorithm, serving as the agent's feedback mechanism By penalizing actions that lead to losses and rewarding those resulting in profits, it guides the agent's behavior in similar future scenarios Through experimentation and analysis, an optimal reward function formula was developed, ensuring effective reinforcement learning.

• 𝐵𝑉 𝑖 is volume of the corresponding purchase price

Total profit is calculated using the formula:

Below is a diagram of the trading bot's operating sequence in stock trading:

Figure 13: Operation flowchart of trading bot

As we mentioned, in this research, we applied Huber loss to our model to estimate the output which is the Q value Huber loss is a combination of the Mean Absolute Error

The Huber loss function effectively combines the desirable characteristics of both the Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions It penalizes closely spaced data points like MAE, while simultaneously minimizing the impact of outliers on the training process This makes the Huber loss function particularly suitable for tasks involving volatile or noisy data, as it balances robustness to outliers with sensitivity to small errors Mathematically, the Huber loss function is defined as:

• x is the error between the actual value and the predicted value

• δ is a fixed threshold, usually chosen to be a value based on the characteristics of the data.

Neural Network architecture

To identify the optimal neural network structure for our input data, we explored various network architectures from prior research, including LSTM, 2D Convolution, and Dense layers Drawing inspiration from these findings, we decided to utilize a 1-dimensional Convolutional Neural Network (CNN) CNN, a variant of MLP, features Conv1D, which performs one-dimensional convolution on the input data, moving a filter to identify patterns and extract important features This architecture is particularly well-suited for time series data like stock prices, which we aim to analyze in our project.

Our neural network structure includes hidden layers containing convolutional layers, max pooling, flatten, dropout and even dense layers at the last hidden layers The activation function we apply in the neural network is the ReLU function, with the formula:

ReLU has simple computation, simply choosing the larger value between 0 and its input, which speeds up the neural network training process compared to more complex activation functions like Sigmoid or Tanh However, the model can still learn effectively because ReLU does not suffer from the vanishing gradient problem during the backpropagation process

Convolutional layers are the core building of a CNN Weights of convolution layers can be seen as 1D filters, and they apply convolution operation with these filters In each convolution layer, we set up a different number of neurons as well as a different number of filters This helps the model to classify the features in the input data while reducing learning and training time but still providing good enough performance for the model, avoiding overfitting

Dropout is a commonly used method in neural network research and applications During the training process, some output layers will be ignored, also known as drop out

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections [12] Randomly setting a fraction of input units to zero will circumvent situations where the network layers jointly adapt to correct errors from previous layers, encouraging the network to actually learn the sparse representation as a side effect, thereby making the model The image becomes stronger

We have set up half (50%) of the input units of the first hidden layer to be randomly

"turned off" (set to a value of 0) during training

After the convolution calculation process to create feature maps, the Max Pooling layer will be used to reduce the size and support the calculation and processing of the model network We adjust the pooling layer size to match the previously initialized state, helping to optimize the performance of the result

The output of the max pooling layer will be fed into a layer that flattens the output of the previous layer into a 1D array This is a necessary step before passing the output to the next fully connected layers The 1-dimensional arrays will be fed into the remaining dense layers for calculation and return the calculated Q-value This also helps control model complexity

To serve model building in trading bot, we collect data from 30 stock codes with large capitalization indexes from the VN30 index We use stock data of 30 companies in the VN30 from January 1, 2014 to August 4, 2024 The trading bot is trained with the training set from January 1, 2014 - December 31, 2022, and tested with the test set from January 1, 2023 - August 4, 2024

Table 3: Training and Testing Set

The VN30 index was officially established by the Ho Chi Minh City Stock Exchange (HOSE) and came into use from February 6, 2012 The VN30 index includes 30 stock codes listed on the HOSE stock exchange These are stocks with high market capitalization and the best liquidity in the Vietnamese stock market In fact, VN30 is a group of stocks of large companies with sustainable growth potential and sustainable business operations The fluctuations of the VN30 index also represent the price fluctuations of leading businesses in the market Companies in the VN30 basket often hold leading positions in their respective fields because if their business operations are not effective, will be replaced by other more potential companies Investors will rely on it to evaluate market trends and make investment choices in appropriate industries and fields.

Implementation tools

RESULTS & DISCUSSION

After training on the training set, we use the models to evaluate the profit and loss results on the test set With the initial capital of each stock code being 100,000,000 VND, trading from January 1, 2023 to March 15, 2023 will yield profits as shown below:

Profit (VND) DQN with fixed Q-target DQN Double

Joint Stock Commercial Bank for Investment and

Commercial Bank for Industry and Trade

Commercial Joint Stock Bank STB 34.005.000 20.565.000 29.430.000

Commercial Joint Stock Bank TCB 35.730.000 45.405.000 50.850.000

23 Bank for Foreign Trade of

Commercial Joint Stock Bank VIB 29.480.400 31.609.200 27.338.400

28 Viet Nam Dairy Products Joint

Table 4: Table of profits of trading bot

Figure 16: Result of 3 Deep reinforcement learning algorithms

The bot does a great job of evaluating low-priced buying actions and high- priced selling actions

Figure 17: The bot performs operations on Techcombank (TCB) stocks

The bot's trading results are quite good with the average profit of each model as below:

DQN DQN with fixed Q-target Double DQN

Table 5: Mean profit of 3 DRL algorithms

The code supporting the presented results is publicly available at the following link: https://github.com/doanan293/NCKH

In Deep Q-Networks (DQN) with Fixed Q-Target, two separate networks (primary and target) are employed The target network's fixed parameters reduce the correlation between target and predicted Q-values, mitigating the overestimation bias common in standard DQN However, this underestimation, while addressing the overestimation issue, can lead to missed decision opportunities in scenarios where immediate action is beneficial.

In Double DQN, the approach is different from DQN with Fixed Q-Target Instead of having separate networks for action selection and value estimation However, during action selection, the primary network is used to choose the action with the highest Q- value Then, instead of directly using the primary network to estimate the value of that action, Double DQN employs the target network to estimate the value This makes the double DQN algorithm underestimate Q value even more than the DQN with Fixed Q- Target algorithm, and also leads to missing actions when the right time comes

With Vietnam's highly volatile stock market, with a huge and unstable amount of state The DQN with Fixed Q-Target algorithm often misses appropriate actions when encountering ideal price situations, this happens even more in the Double DQN algorithm Just like my initial prediction, for the Vietnamese stock market, the Vanilla Deep Q-Network algorithm will produce better results than the other two algorithms.

LIMITATION AND RECOMMENDATION

In the experiment, we only let the volume of shares the bot can buy be 300, 600, 900 shares; When choosing the sell action, you will sell out We need to limit the number of actions the bot can choose from, so it doesn't get confused when making decisions This is because deep reinforcement learning training consumes a lot of computational resources Therefore, users of non-specialized computers like us only set 5 types of actions

To enhance the trading bot's capabilities, we initially simplified the input state by considering only the previous 10 days' prices and employed a sigmoid function to reduce state variability However, to optimize the bot's actions and state compatibility, extensive training is essential Our limited computational resources restricted us to 50 training episodes, potentially compromising the results Future research should expand the bot's action options and refine the input state representation to capture stock price fluctuations more effectively Additionally, improving the trading bot's action policies and exploring more efficient training methods remain crucial.

ABBREVIATION (in alphabetical order)

DDPG Deep Deterministic Policy Gradient

HOSE Ho Chi Minh City Stock Exchange

NB The number of bought shares

NCB The number of shares a trading bot can buy

LSTM Long Short Term Memory

SP The current stock price

REFERENCES

[1] Anfin, T (2021) Nguồn gốc và sự hình thành của thị trường chứng khoán https://www.anfin.vn/blog/lich-su-thi-truong-chung-khoan

[2] Awad, A L., Elkaffas, S M., & Fakhr, M W (2023) Stock Market Prediction Using Deep Reinforcement Learning Applied System Innovation, 6(6), 106

[3] Sun, S., Wang, R., & An, B (2023) Reinforcement learning for quantitative trading ACM Transactions on Intelligent Systems and Technology, 14(3), 1–29)

[4] Zhang, Z., Zohren, S., & Roberts, S (2019) Deep reinforcement learning for trading arXiv preprint arXiv:1911.10107

[5] Conegundes, L., & Pereira, A C M (2020, July) Beating the stock market with a deep reinforcement learning day trading system In 2020 International Joint Conference on Neural Networks (IJCNN) (pp 1-8) IEEE

[6] Van Otterlo, M., & Wiering, M (2012) Reinforcement learning and markov decision processes In Reinforcement learning: State-of-the-art (pp 3-42) Berlin, Heidelberg: Springer Berlin Heidelberg

[7] Pieters, M., & Wiering, M A (2016, December) Q-learning with experience replay in a dynamic environment In 2016 IEEE Symposium Series on Computational

[8] Pan, J., Wang, X., Cheng, Y., & Yu, Q (2018) Multisource transfer double DQN based on actor learning IEEE transactions on neural networks and learning systems, 29(6), 2227-2238

[9] Kyurkchiev, N., & Markov, S (2015) Sigmoid Functions: Some Approximation and

Modelling Aspects: Some Moduli in Programming Environment MATHEMATICA LAP

Tiêu đề	Studying Deep Reinforcement Learning and Application for Trading on Vietnamese stock market
Tác giả	Doãn Văn An, Phan Gia Bảo, Lê Thúy Huyền
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Business Data Analytics
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	37
Dung lượng	1,6 MB