This study presents a novel stock market prediction architecture that integrates these factors through multiple data pre-processing schemes and ensemble learning techniques, resulting in
Trang 1
TECHNOLOGY AND EDUCATION
HO CHI MINH CITY UNIVERSITY OF
APPLICATION FOR COMPREHENSIVE INVESTMENT, MARKET INSIGHTS AND TRADING STRATEGY USING
MACHINE LEARNING
LECTURER: TRẦN NHẬT QUANG, PhD STUDENTS: BUI DUC NHAN
BUI HUU LUAN
GRADUATION THESIS INFORMATION TECHNOLOGY
S K L 0 1 1 6 3 6
Trang 2HO CHI MINH UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
- -
GRADUATION PROJECT APPLICATION FOR COMPREHENSIVE
INVESTMENT, MARKET INSIGHTS AND TRADING
STRATEGY USING MACHINE LEARNING
Advisor: TRẦN NHẬT QUANG, PhD
Ho Chi Minh city, July - 2023
MAJOR: INFORMATION TECHNOLOGY
Students:
Trang 3THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, January 20, 2020
GRADUATION PROJECT ASSIGNMENT
Student name: _ Student ID: _
Student name: Student ID: _
Student name: Student ID: _
Major: Class:
Advisor: Phone number: _
Date of assignment: _ Date of submission: _
1 Project title: _
2 Initial materials provided by the advisor: _
3 Content of the project: _
4 Final product:
CHAIR OF THE PROGRAM
(Sign with full name)
ADVISOR
(Sign with full name)
Trang 4THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, January 20, 2020 ADVISOR’S EVALUATION SHEET Student name: Student ID:
Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Advisor:
EVALUATION 1 Content of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark:……….(in words: )
Ho Chi Minh City, month day, year
ADVISOR
(Sign with full name)
Trang 5THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, January 20, 2020 PRE-DEFENSE EVALUATION SHEET Student name: Student ID:
Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Name of Reviewer:
EVALUATION 1 Content and workload of the project
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark:……….(in words: )
Ho Chi Minh City, month day, year
REVIEWER
(Sign with full name)
Trang 6THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER Student name: Student ID:
Student name: Student ID:
Student name: Student ID:
Major:
Project title:
Name of Defense Committee Member:
EVALUATION 1 Content and workload of the project
2 Strengths:
3 Weaknesses:
4 Overall evaluation: (Excellent, Good, Fair, Poor)
5 Mark:……….(in words: )
Ho Chi Minh City, month day, year
COMMITTEE MEMBER
(Sign with full name)
Trang 7ACKNOWLEDGEMENT
First and foremost, with all my heartfelt gratitude, I would like to express my gratitude to Professor Tran Nhat Quang Throughout the process of completing my thesis, you have guided me with utmost dedication and sincerity Not only have you helped me consolidate the fundamental knowledge I have learned, but you have also expanded upon advanced topics in a profound manner You have provided me with a comprehensive perspective, helping me broaden my thinking to explore various interesting and innovative aspects that
I had never considered before With your profound knowledge, you have imparted practical wisdom acquired from life, teaching, and work, from which I have learned many fascinating things
I am grateful to my family, who has been a solid foundation for me throughout the time I have been working on this thesis
Last but not least, I would like to express my gratitude to the teachers who have taught me
in the previous courses They were the first ones to help me build my knowledge to where
it is today These teachers are like mirrors reflecting successful lives, inspiring me to strive and hope for a bright future I also want to thank the high-quality Department of Computer Science for providing the conditions for me and my peers to have a memorable learning experience
Ho Chi Minh city, July 2023
Members of group:
Trang 8
Table of Contents
ACKNOWLEDGEMENT v
ABSTRACT xi
CHAPTER 1: INTRODUCTION 12
CHAPTER 2: LITERATURE REVIEW AND THEORY BASIS 13
2.1 Literature review 13
2.2 Theory Basis: 15
2.2.1 Stock Indicators 15
2.2.2 LSTM 29
2.2.3 Support Vector Machine 29
2.2.4 Random Forest 30
2.2.5 XGBoost 31
2.2.6 Nature Language Processing-NLP 32
CHAPTER 3: METHODOLOGY 37
3.1 Data Collection 37
3.1.1 Historical stocks data 37
3.1.2 Stocks news and embedding data 41
3.1.3 Data preprocessing 43
3.2 Time series data set and labeling 46
3.3 Pre-trained models and ensemble approach 47
3.3.1 Support Vector Machine 47
3.3.2 Random Forest 49
3.3.3 eXtreme Gradient Boosting 51
3.3.4 Long-Short Term Memory 52
3.3.5 Ensemble Model 53
CHAPTER 4: TRAINING 57
4.1 Loss Functions and evaluation metrics 57
4.1.1 Binary Cross Entropy 57
4.1.2 Accuracy 57
4.2 Fine tuning and grid search 58
4.2.1 Grid Search 58
Trang 94.2.2 Base model validation accuracy 59
4.2.3 Ensemble validation: 65
CHAPTER 5: EVALUATION 66
5.1 Apple Inc (AAPL) evaluation 66
5.2 Amazon Inc (AMZN) evaluation 68
5.3 Alphabet Inc Class A (GOOGL) evaluation 70
5.4 Microsoft Corp (MSFT) evaluation 72
5.5 Tesla Inc (TSLA) evaluation 74
5.6 Ensemble model evaluation 76
5.7 Comparing textual versus non-textual input models 78
CHAPTER 6: SYSTEM ARCHITECTURE AND TECHNOLOGY 80
6.1 Purpose of project 80
6.2 Application’s interface and features 80
6.3 System Architecture 83
6.3.1 Library 84
6.3.2 Technologies 85
6.3.2.1 Flask 85
6.3.1.2 TensorFlow 85
6.3.1.3 Torch 85
CHAPTER 7: CONCLUSION 86
REFERENCES 87
Trang 10Table of Figures
Figure 1: LSTM structure 29
Figure 2: SVM Classifier 30
Figure 3: Random Forest Simplified 31
Figure 4: The skip-gram model 32
Figure 5: Transformer structure 34
Figure 6: Overview of BERT 35
Figure 7: Dataset train validation test split 45
Figure 8: Time Series Data 47
Figure 9: The process of the Random Forest algorithm 49
Figure 10: The process of bootstrap aggregating (Bagging) 51
Figure 11: Ensemble Model’s Diagram 56
Figure 12: The bar chart of balanced accuracy of Apple 66
Figure 13: The bar chart of test accuracy of Apple for data type 0 67
Figure 14: The bar chart of test accuracy of Apple for data type 1 67
Figure 15: The bar chart of test accuracy of Apple for data type 2 67
Figure 16: The bar chart of balanced accuracy of Amazon 68
Figure 17: The bar chart of test accuracy of Amazon for data type 0 69
Figure 18: The bar chart of test accuracy of Amazon for data type 1 69
Figure 19: The bar chart of test accuracy of Amazon for data type 2 70
Figure 20: The bar chart of balanced accuracy of Google 70
Figure 21: The bar chart of test accuracy of Google for data type 0 71
Figure 22: The bar chart of test accuracy of Google for data type 1 71
Figure 23: The bar chart of test accuracy of Google for data type 2 72
Figure 24: The bar chart of balanced accuracy of Microsoft 72
Figure 25: The bar chart of test accuracy of Microsoft for data type 0 73
Figure 26: The bar chart of test accuracy of Microsoft for data type 1 73
Figure 27: The bar chart of test accuracy of Microsoft for data type 2 74
Figure 28: The bar chart of balanced accuracy of Tesla 74
Figure 29: The bar chart of test accuracy of Tesla for data type 0 75
Figure 30: The bar chart of test accuracy of Tesla for data type 1 75
Figure 31: The bar chart of test accuracy of Tesla for data type 2 76
Figure 32: Accuracy of ensemble for Apple 76
Figure 33: Accuracy of ensemble for Amazon 77
Trang 11Figure 34: Accuracy of ensemble for Google 77
Figure 35: Accuracy of ensemble for Microsoft 78
Figure 36: Accuracy of ensemble for Tesla 78
Figure 37: Application’s chart and indicators 81
Figure 38: Application’s chart and indicators 82
Figure 39: Technical of the stock 83
Figure 40: Prediction using machine learning models 83
Figure 41: System Architecture of the website 83
Trang 12Table of tables
Table 1: Signals of the indicators 28
Table 2: Indicator types 41
Table 3: SVM parameters 49
Table 4: Random Forest parameters 51
Table 5: XGBoost parameters 52
Table 6: LSTM parameters 53
Table 7: Data type table 55
Table 8: SVM grid search ranges 58
Table 9: Random Forest grid search ranges 58
Table 10: XGBoost grid search ranges 59
Table 11: Lstm grid search ranges 59
Table 12: The result of validation accuracy of Apple 60
Table 13: The result of validation accuracy of Amazon 61
Table 14: The result of validation accuracy of Google 62
Table 15: The result of validation accuracy of Microsoft 63
Table 16: The result of validation accuracy of Tesla 64
Table 17: Data type table 65
Table 18: Ensemble evaluation table 65
Table 19: Front end stacks 84
Table 20: Back-end stacks 85
Trang 13ABSTRACT
Stock price prediction is a complex problem influenced by diverse political and economic factors This study presents a novel stock market prediction architecture that integrates these factors through multiple data pre-processing schemes and ensemble learning techniques, resulting in improved prediction accuracy Both numerical and textual data formats are utilized as inputs for ensemble regressors and classifiers to learn features The trained results are concatenated and fed into a deep learning layer to predict the direction
of closing prices Empirical results based on news and historical data from four prominent companies (Apple Inc., Microsoft Corporation, Alphabet Inc., Amazon.com, Inc., and Tesla Inc.) to demonstrate the effectiveness of the proposed prediction model
Keywords—ensemble learning, natural learning process, technical analysis, time series analysis, stock forecasting
Trang 14CHAPTER 1: INTRODUCTION
The stock market is a volatile and unpredictable market The price of the market fluctuates based on various factors, including micro and macroeconomic indicators, as well as investor sentiment and political factors With the goal of building an application that can analyze and provide the deepest insights into stock value, we aim to provide investors with
a versatile tool that offers a multidimensional perspective on the value of stocks
To analyze the market and make predictions about the direction of stock prices, we have developed an application that relies on microeconomic factors, such as stock prices, stock indices, and financial analysis By studying how the market operates and develops based
on the aforementioned factors, we have developed a machine learning application that utilizes various models to predict stock trends
We focus on using fundamental data such as high, low, close prices, and trading volume to calculate indicators and make predictions about future stock prices By analyzing these data patterns and applying data processing techniques, we generate diverse indicators and features that provide investors with profound and multifaceted insights into stock value
By integrating indicators derived from fundamental data, we have built a machine learning model to predict stock price trends The use of machine learning and predictive models has helped us gain a better understanding of how the stock market operates and develops based
on the mentioned factors
Deep learning and machine learning algorithms, including models such as LSTM, Random Forest, Support Vector Machine, ensemble algorithm XGBoost, and a combination of the mentioned models, along with neural networks, have been employed in practical experiments Furthermore, we have collected and applied financial news relevant to stocks
as feature to enhance our model and an assembler combined of pretrained models, including multiple NLP models for text preprocessing and features extraction as well as Random Forest, Support Vector Machine, XGBoost and LSTM based on the news data as textual features
To study and predict stock market trends, whether or not stocks market is predictable based
on financial news and is it possible to predict stock market, we have compared and evaluate varies of settings for our models and their performance Finally, we have built an application to give insights of our researched stock as an application for our models' usage
Trang 15CHAPTER 2: LITERATURE REVIEW AND THEORY BASIS
2.1 Literature review
2.1.1 Factors affecting share prices: A literature revisit [1]
Various factors influence stock market movements, making it a complex and dynamic system These factors can be broadly categorized into internal and external factors Internal factors include company-specific elements such as financial performance, management decisions, and ownership structure External factors encompass macroeconomic indicators, market sentiment, political events, and global economic conditions The interplay of these factors, along with investor behavior and market psychology, contributes to the volatility and unpredictability of stock movements Understanding and analyzing these multifaceted factors are essential for comprehending and predicting stock market trends
2.1.2 A novel ensemble deep learning model for stock prediction based on stock prices and news [2]
The significant contribution of this paper is the proposed model, which uses multiple models to complement each other, and different models can learn different characters of the data, which can also reduce noise Moreover, finding open doors for more sophisticated ensemble deep learning models and using more complex data sources for stock prediction
in the future
2.1.3 Stock Price Forecasting via Sentiment Analysis on Twitter [3]
The study conducted by John Kordonis et al highlights the potential of sentiment analysis
on social media platforms, specifically Twitter, as a means to forecast stock market behavior By employing machine learning techniques and analyzing the sentiment of aggregated tweets, the authors observed correlations between tweet sentiment and subsequent stock price movements This research contributes to the evolving field of stock price forecasting and lays the foundation for further investigations into sentiment-based prediction models
2.1.4 Deep learning for stock prediction using numerical and textual information [4]
This paper presents a novel approach to stock price prediction that leverages distributed representations of news articles and considers the correlations among related companies
By utilizing a recurrent network, the proposed methodology captures the temporal influence of time series data on stock price fluctuations The experimental results showcase the potential of this approach in improving the accuracy of stock price forecasting, contributing to advancements in financial market analysis and decision-making processes
2.1.5 DP-LSTM: Differential Privacy-inspired LSTM for Stock Prediction Using Financial News [5]
Trang 16This paper presents an integrated approach that combines deep neural networks with NLP models, specifically VADER, to extract opinions from text data for stock price prediction The proposed DP-LSTM network, incorporating differential privacy techniques, showcases robust performance in accurately predicting stock prices, especially for the S&P
500 index These findings contribute to the advancement of sentiment-based prediction models and provide valuable insights for investors and financial analysts in managing investment risks
2.1.6 On the Properties of Neural Machine Translation: Encoder–Decoder Approaches [6]
This paper focuses on the evaluation of different encoder choices within the decoder architecture, including recurrent neural networks (RNNs) with gated hidden units and gated recursive convolutional neural networks (grConv) Previous studies have shown that RNN-based models suffer from performance degradation with longer sentences, leading to the exploration of alternative encoders like grConv While both RNN and grConv models face challenges in handling longer sentences, qualitative analysis demonstrates their ability to generate accurate translations Future research directions include scaling up training procedures, improving performance with long sentences, and exploring diverse neural architectures, while the unique grammatical structure mimicking property of grConv suggests its potential applicability in other natural language processing tasks beyond machine translation
encoder-2.1.7 Recurrent neural networks and robust time series prediction [7]
This paper introduces a robust learning algorithm for recurrent neural networks (RNNs) The algorithm filters outliers from both the target function and input data, improving parameter estimation Comparisons on synthetic and real data demonstrate the advantages
of the proposed approach over conventional methods Filtering the Puget Power Electric Demand time series removes outliers and enhances prediction accuracy compared to unfiltered data
2.1.8 VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text [8]
This paper presents the creation and evaluation of VADER (Valence Aware Dictionary for Sentiment Reasoning), a sentiment lexicon specifically designed for microblog-like contexts The authors employ a combination of qualitative and quantitative methods to develop a reliable gold-standard sentiment lexicon They then enhance the lexicon by incorporating five generalizable rules that capture grammatical and syntactical conventions used to express sentiment intensity The study demonstrates that incorporating these heuristics improves the accuracy of sentiment analysis in various domain contexts
2.1.9 Improvement methods for stock market prediction using financial news articles [9]
Trang 17In this study, the researchers investigated the correlation between financial news and stock prices, aiming to establish a relationship between the two The study involved collecting financial news data and corresponding stock prices, followed by a rigorous experimental analysis The results demonstrated a significant correlation, achieving a high accuracy rate
of 73% Additionally, the researchers enhanced the system's performance by removing weak stock tickers in the VN30 index using technical indicators, leading to further improvements However, the authors acknowledged that the success ratio could be enhanced by analyzing news from more reliable sources Looking ahead, future work will focus on combining stock price prediction with technical analysis to further enhance the system's performance
2.1.10 Stock trend prediction using simple moving average supported by news classification [10]
This paper uses machine learning using artificial neural network to combine the two aspects The experiment in this paper uses approximately one year's worth of stock data and financial news Artificial neural network is able to combine simple moving average technique and news classification, and the result indicates that financial news can improve the prediction responsiveness
To put it another way, bull markets typically outlast bear markets
Trang 18Readings above -20 for the 14-day Williams %R would indicate that the underlying security was trading near the top of its 14-day high-low range Readings below -80 occur when a security is trading at the low end of its high-low range Default settings use -20 as the overbought threshold and -80 as the oversold threshold These levels can be adjusted depending on the security’s characteristics
% = ℎ ℎ ℎ −ℎ −Where:
- Highest High: Highest High in the lookback period, typically 14 days
- Close: Most recent Close price
- Lowest Low: Lowest Low in the lookback period, typically 14 days
stochRSI [13]
The Stochastic Oscillator (STOCH) is a range bound momentum oscillator The Stochastic indicator is designed to display the location of the close compared to the high/low range over a user defined number of periods Typically, the Stochastic Oscillator is used for three things; Identifying overbought and oversold levels, spotting divergences and also identifying bull and bear set ups or signals
A StochRSI reading above 0.8 is considered overbought, while a reading below 0.2 is considered oversold On the zero to 100 scale, above 80 is overbought, and below 20 is oversold
Overbought doesn't necessarily mean the price will reverse lower, just like oversold doesn't mean the price will reverse higher Rather the overbought and oversold conditions simply alert traders that the RSI is near the extremes of its recent readings
Commodity Channel Index [14]
The Commodity Channel Index (CCI) is a momentum oscillator used in technical analysis primarily to identify overbought and oversold levels by measuring an instrument's variations away from its statistical mean CCI is a very well-known and widely used indicator that has gained level of popularity in no small part of its versatility Besides overbought/oversold levels, CCI is often used to find reversals as well as divergences Originally, the indicator was designed to be used for identifying trends in commodities, however it is now used in a wide range of financial instruments
When the CCI moves from negative or near-zero territory to above 100, that may indicate the price is starting a new uptrend Once this occurs, traders can watch for a pullback in price followed by a rally in both price and the CCI to signal a buying opportunity
Trang 19The same concept applies to an emerging downtrend When the indicator goes from positive or near-zero readings to below -100, then a downtrend may be starting This is a signal to get out of longs or to start watching for shorting opportunities
= 0.015 × ! " #− !
Moving Average Convergence/Divergence [15]
MACD is an extremely popular indicator used in technical analysis MACD can be used
to identify aspects of a security's overall trend Most notably these aspects are momentum,
as well as trend direction and duration What makes MACD so informative is that it is actually the combination of two different types of indicators First, MACD employs two Moving Averages of varying lengths (which are lagging indicators) to identify trend direction and duration Then, MACD takes the difference in values between those two Moving Averages (MACD Line) and an EMA of those Moving Averages (Signal Line) and plots that difference between the two lines as a histogram which oscillates above and below a center Zero Line The histogram is used as a good indication of a security's momentum
MACD crossing above zero is considered bullish, while crossing below zero is bearish Secondly, when MACD turns up from below zero it is considered bullish When it turns down from above zero it is considered bearish
When the MACD line crosses from below to above the signal line, the indicator is considered bullish The further below the zero line the stronger the signal When the MACD line crosses from above to below the signal line, the indicator is considered bearish The further above the zero line the stronger the signal
" ! = $ %&− $ &'( ! ! = $ )( " ! )
Directional Movement [16]
Directional Movement (DMI) is actually a collection of three separate indicators combined into one Directional Movement consists of the Average Directional Index (ADX), Plus Directional Indicator (+DI) and Minus Directional Indicator (-DI) ADX's purposes is to define whether or not there is a trend present It does not take direction into account at all The other two indicators (+DI and -DI) are used to compliment the ADX They serve the purpose of determining trend direction By combining all three, a technical analyst has a way of determining and measuring a trend's strength as well as its direction
Crossovers are the main trade signals A long trade is taken when the +DI crosses above the -DI and an uptrend could be underway Meanwhile, a sell signal occurs when the +DI instead crosses below the -DI In such cases, a short trade may be initiated because a downtrend might be underway
Trang 20The indicator can also be used as a trend or trade confirmation tool If the +DI is well above -DI, the trend has strength on the upside, and this would help confirm current long trades
or new long trade signals based on other entry methods Conversely, if -DI is well above +DI, this confirms the strong downtrend or short positions
If the Forecast Oscillator stays above the zero line for an extended period, then it signals that the price may rise in the future and if it stays below the zero line for an extended period, then it signals a coming fall in the security price
Chande Momentum Oscillator [18]
The Chande momentum oscillator is a technical momentum indicator introduced by Tushar Chande in his 1994 book The New Technical Trader The formula calculates the difference
Trang 21between the sum of recent gains and the sum of recent losses and then divides the result by the sum of all price movements over the same period
The Chande oscillator is similar to other momentum indicators such as Wilder’s relative strength index (RSI) and the stochastic oscillator It measures momentum on both up and down days and does not smooth results, triggering more frequent oversold and overbought penetrations The indicator oscillates between +100 and -100
A security is deemed to be overbought when the Chande momentum oscillator is above +50 and oversold when it is below -50 Many technical traders add a 10-period moving average to this oscillator to act as a signal line The oscillator generates a bullish signal when it crosses above the moving average and a bearish signal when it drops below the moving average
The oscillator can be used as a confirmation signal when it crosses above or below the 0 line For example, if the 50-day moving average crosses above the 200-day moving average (golden cross), a buy signal is confirmed when the Chande momentum oscillator crosses above 0, predicting prices are headed higher
E = ((Q− (R
Q+ (R× 100 Where:
- Su = Sum of the difference between the current close and previous close on up days for the specified period Up days are days when the current close is greater than the previous close
- Sd = Sum of the absolute value of the difference between the current close and the previous close on down days for the specified period Down days are days when the current close is less than the previous close
Efficiency Ratio [19]
Efficiency ratios measure a company's ability to use its assets and manage its liabilities effectively in the current period or in the short-term Although there are several efficiency ratios, they are similar in that they measure the time it takes to generate cash or income from a client or by liquidating inventory
Efficiency ratios include the inventory turnover ratio, asset turnover ratio, and receivables turnover ratio These ratios measure how efficiently a company uses its assets to generate revenues and its ability to manage those assets With any financial ratio, it's best to compare
a company's ratio to its competitors in the same industry
Efficiency ratios, also known as activity ratios, are used by analysts to measure the performance of a company's short-term or current performance All these ratios use numbers in a company's current assets or current liabilities, quantifying the operations of the business
Trang 22An efficiency ratio measures a company's ability to use its assets to generate income For example, an efficiency ratio often looks at various aspects of the company, such as the time
it takes to collect cash from customers or the amount of time it takes to convert inventory
to cash This makes efficiency ratios important, because an improvement in the efficiency ratios usually translates to improved profitability
$SS ! = T E ! $P !! ! × 100
Rate of Change [20]
The rate of change (ROC) is the speed at which variable changes over a specific period of time ROC is often used when speaking about momentum, and it can generally be expressed as a ratio between a change in one variable relative to a corresponding change
in another; graphically, the rate of change is represented by the slope of a line The ROC
is often illustrated by the Greek letter delta (Δ)
Rate of change is an extremely important financial concept because it allows investors to spot security momentum and other trends
The ROC is plotted against a zero line that differentiates positive and negative values Positive values indicate upward buying pressure or momentum, while negative values below zero indicate selling pressure or downward momentum Increasing values in either direction, positive or negative, indicate increasing momentum, and moves back toward zero indicate waning momentum
Zero-line crossovers can be used to signal trend changes Depending on the n value used these signals may come early in a trend change (small n value) or very late in a trend change (larger n value) The ROC is prone to whipsaws, especially around the zero line Therefore, this signal is generally not used for trading purposes, but rather to simply alert traders that
a trend change may be underway
Overbought and oversold levels are also used These levels are not fixed but will vary by the asset being traded Traders look to see what ROC values resulted in price reversals in the past Often traders will find both positive and negative values where the price reversed with some regularity When the ROC reaches these extreme readings again, traders will be
on high alert and watch for the price to start reversing to confirm the ROC signal
U = V − VIW
VIW × 100 Where:
- V = Closing price of most recent period
- VIW = Closing price n periods before most recent period
Trang 23The Schaff Trend Cycle [21]
The Schaff Trend Cycle (STC) is a charting indicator that is commonly used to identify market trends and provide buy and sell signals to traders Developed in 1999 by noted currency trader Doug Schaff, STC is a type of oscillator and assumes that, regardless of time frame, currency trends accelerate and decelerate in cyclical patterns
The Schaff Trend Cycle indicator is one of the most effective ways to let investors know about the market trend, overbought and oversold conditions, buying and selling positions, and ideal entry and exit points The Schaff and Trend Cycle indicator uses two thresholds
of 25 and 75 If the Schaff and Trend Cycle indicator crosses the first threshold of 25, it generally indicates that the market is in an uptrend
However, if the Schaff and Trend Cycle indicator has breached the 75 levels line, it generally indicates the strengthening of the trend in either of the two directions (high or low) Furthermore, when the Schaff and Trend Cycle indicator’s straight line is above 75,
it is a signal for an overbought condition On the other hand, if the straight line is below
25, it signals oversold stocks
( ℎ SS !/ !/ = 100 ×%"( " − %X (") − %X( ")")The calculation is based on the following three inputs:
- The default period of the short-term exponential moving average is 23 days
- The default period of the long-term exponential moving average is 50 days
- Then calculate: " = $ &Y – $ Z[
- After that calculate the 10-period Stochastic from the above MACD value:
%X( ") = %X\( ", 10)
%"( ") = %"\( ", 10)
Elder's Bulls Ray Index [22]
The Elder-Ray Index is a technical indicator developed by Dr Alexander Elder that measures the amount of buying and selling pressure in a market This indicator consists of two indicators known as "bull power" and "bear power," which are derived from a 13-period exponential moving average (EMA) These, along with the EMA, help traders determine the trend direction and isolate spots to enter and exit trades
Technical traders will use the values of bull and bear power, along with divergence, to make trading decisions Long positions are taken when the bear power has a value below zero but is increasing, and the bull power's latest peak is higher than it was previously (rising) A short position is taken when the bull power value is positive but falling, and the bear power's recent low is lower than it was previously (falling)
Trang 24^ = / − 13 / $Where:
When the stock breaks through the upper band, some traders believe this generates a buy signal (breaking through a resistance level) When it breaks below the lower band, some traders believe this is a sell signal (breaking through a support level) According to this interpretation, if the stock price continues to rise, it could break above the upper part of the band to generate a buy signal It’s worth noting that Bollinger believes a close either above the band or below the band is not necessarily a reversal signal, but rather a continuation pattern
Simple Moving Average [24]
A simple moving average (SMA) is an arithmetic moving average calculated by adding recent prices and then dividing that figure by the number of time periods in the calculation average The simplest use of an SMA in technical analysis is using it to quickly determine
if an asset is in an uptrend or downtrend
A simple moving average smooths out volatility and makes it easier to view the price trend
of a security If the simple moving average points up, this means that the security's price is increasing If it is pointing down, it means that the security's price is decreasing The longer the time frame for the moving average, the smoother the simple moving average A shorter-term moving average is more volatile, but its reading is closer to the source data
Trang 25Exponential Moving Average [25]
An exponential moving average (EMA) is a type of moving average (MA) that places a greater weight and significance on the most recent data points The exponential moving average is also referred to as the exponentially weighted moving average An exponentially weighted moving average reacts more significantly to recent price changes than a simple moving average simple moving average (SMA), which applies an equal weight to all observations in the period
Use the same rules that apply to SMA when interpreting EMA Keep in mind that EMA is generally more sensitive to price movement This can be a double-edged sword On one side, it can help you identify trends earlier than an SMA would On the flip side, the EMA will probably experience more short-term changes than a corresponding SMA
Use the EMA to determine trend direction, and trade in that direction When the EMA rises, you may want to consider buying when prices dip near or just below the EMA When the EMA falls, you may consider selling when prices rally towards or just above the EMA
$ U = BgRhK× - .(1 + T) 0 + $ℎ ! KijBikRhK × l1 − - .1 + T 0mℎ !Where:
- T = Number of prices in day range
- F = Closing price
Volume Weighted Average Price [26]
The volume-weighted average price (VWAP) is a technical analysis indicator used on intraday charts that resets at the start of every new trading session It's a trading benchmark that represents the average price a security has traded at throughout the day, based on both volume and price
VWAP is used in different ways by traders Traders may use VWAP as a trend confirmation tool and build trading rules around it For instance, they may consider stocks with prices below VWAP as undervalued and those with prices above it, overvalued If prices below VWAP move above it, traders may go long on the stock If prices above VWAP move below it, they may sell their positions or initiate short positions
Institutional buyers including mutual funds use VWAP to help move into or out of stocks with as small of a market impact as possible Therefore, when they can, institutions will
Trang 26try to buy below the VWAP, or sell above it This way their actions push the price back toward the average, instead of away from it
\n = 5.5 # \ 5.! × \ 5.!
Where:
Hull Moving Average [27]
It is a way of calculating the moving average for an asset price that aims to reduce lag to make it more responsive to the current price, while smoothing out the curve on a chart
As with other moving averages, swing traders and long-term traders can use the HMA moving average as a directional trend indicator to identify or confirm price trends for a stock or other asset price As it is more responsive to price changes, the HMA line turns faster and more decisively on a chart than other moving averages, such as the simple moving average or exponential moving average, while maintaining a smooth curve on a price chart
A longer period HMA may be used to identify trend If the HMA is rising, the prevailing trend is rising, indicating it may be better to enter long positions If the HMA is falling, the prevailing trend is also falling, indicating it may be better to enter short positions
A shorter period HMA may be used for entry signals in the direction of the prevailing trend
A long entry signal, when the prevailing trend is rising, occurs when the HMA turns up and a short entry signal, when the prevailing trend is falling, occurs when the HMA turns down
= n (2 × n (!2) − n (!)), √!) Where Weighted Moving Average:
- n U =(U×(Uq%))& ∑UI% F× (T − )
FC[
Chaikin Money Flow [28]
Developed by Marc Chaikin, Chaikin Money Flow measures the amount of Money Flow Volume over a specific period Money Flow Volume forms the basis for the Accumulation Distribution Line Instead of a cumulative total, Chaikin Money Flow sums Money Flow Volume for a specific look-back period, typically 20 or 21 days The resulting indicator fluctuates above/below the zero line just like an oscillator Chartists weigh the balance of buying or selling pressure with the absolute level of Chaikin Money Flow Additionally,
Trang 27chartists can look for crosses above or below the zero line to identify changes on money flow
A CMF value above the zero line is a sign of strength in the market, and a value below the zero line is a sign of weakness in the market Wait for the CMF to confirm the breakout direction of price action through trend lines or through support and resistance lines For example, if a price breaks upward through resistance, wait for the CMF to have a positive value to confirm the breakout direction
A CMF sell signal occurs when price action develops a higher high into overbought zones, with the CMF diverging with a lower high and beginning to fall A CMF buy signal occurs when price action develops a lower low into oversold zones, with the CMF diverging with
a higher low and beginning to rise
The Ultimate Oscillator is a technical indicator that was developed by Larry Williams in
1976 to measure the price momentum of an asset across multiple timeframes By using the weighted average of three different timeframes the indicator has less volatility and fewer trade signals compared to other oscillators that rely on a single timeframe Buy and sell signals are generated following divergences The Ultimately Oscillator generates fewer divergence signals than other oscillators due to its multi-timeframe construction
The Ultimate Oscillator is a range-bound indicator with a value that fluctuates between 0 and 100 Similar to the Relative Strength Index (RSI), levels below 30 are deemed to be oversold, and levels above 70 are deemed to be overbought Trading signals are generated when the price moves in the opposite direction as the indicator and are based on a three-step method
In order for the indicator to generate a buy signal, Williams recommended a three-step approach
Trang 28 First, a bullish divergence must form This is when the price makes a lower low but the indicator is at a higher low
Second, the first low in the divergence (the lower one) must have been below 30 This means the divergence started from oversold territory and is more likely to result
in an upside price reversal
Third, the Ultimate oscillator must rise above the divergence high The divergence high is the high point between the two lows of the divergence
Slope measures the rise-over-run of a linear regression In general, an uptrend is present when Slope is positive, and a downtrend exists when the slope is negative The timeframe depends on the number of days 10 days covers a short-term trend, 100 days a medium-term trend, and 250 days a long-term trend As with typical trend-following indicators, Slope lags price and reverses after an actual top or bottom This does not, however, detract from its usefulness Trend identification and trend strength are important tools, even for traders As with moving averages, Slope can be used with momentum indicators to participate in an ongoing trend
We can compare this Linear Regression Slope Indicator for multiple securities to determine relative strengths and weaknesses We can also use it with other indicators for identifying possible entry and exit levels And we can also calculate this for short, medium, and long-term to identify changes within the major trend of the security
v = + O1 Where:
slope value by one hundred and then dividing the result by the price:
∑(P − P̅)&
Trang 29Utilizing indicators to determine stocks’ price movement signals
Stock investors utilize stock indicators to make decisions and evaluate the value of stocks
at the current moment These stock indicators utilize stock prices over specific time intervals to determine whether to buy or sell stocks The use of stock indicators provides investors with insights into market trends and helps guide their investment decisions By analyzing and interpreting these indicators, investors can gain a better understanding of stock market dynamics and make informed choices regarding their stock portfolios
Slow period = 26 Signal = 9
Signal = Buy
Trang 30Signal = Buy
S ( [ ] < [ ]) & ( [ − 1] ≤[ − 1]:
Signal = Sell
medium period = 14 slow period = 28
Trang 312.2.2 LSTM
LSTM represents a powerful and widely used approach for processing sequential data This
architecture has unique mechanisms such as the forget gate, input gate, and output gate,
which enable it to selectively store and discard information while maintaining important global information in the cell state Compared to traditional recurrent neural networks (RNNs), which suffer from the vanishing gradient problem, LSTM's incorporation of memory cells and gating mechanisms have demonstrated superior performance on a range
of sequential tasks In particular, the deep structure of non-linear functions in LSTM makes
it well-suited for handling time series data efficiently while using fewer computational resources [31]
Figure 1: LSTM structure [32]
Increasing the number of layers and hidden units in an LSTM can improve its ability to model structured data through linear combinations However, stock data consists of multiple dimensions, such as high, low, close, and various indicators, which are not always related As a result, an LSTM may not be able to capture the complex patterns and relationships within the data Additionally, the predictive nature of an LSTM is based on past events, which may not accurately reflect the dynamic and ever-changing nature of the stock market
2.2.3 Support Vector Machine
Support Vector Machines (SVM) is a machine learning algorithm widely used in predicting stock trends SVM operates on the principle of finding an optimal hyperplane that best separates different classes of data points In the context of stock trend prediction, SVM seeks to classify stock prices into either an upward or downward trend By utilizing historical stock data and relevant features such as price, volume, and technical indicators, SVM analyzes patterns and establishes decision boundaries to distinguish between bullish
Trang 32and bearish trends The algorithm aims to maximize the margin between the hyperplane and the closest data points, thereby improving its ability to generalize and make accurate predictions for unseen data SVM's robustness, ability to handle high-dimensional data, and flexibility in incorporating various features make it a valuable tool for traders and investors seeking insights into stock market trends [33]
Figure 2: SVM Classifier [34]
One of the key advantages of SVM is its ability to handle high-dimensional data, making
it suitable for incorporating numerous features and indicators that influence stock price movements Moreover, SVM aims to maximize the margin between the hyperplane and the nearest data points, enhancing its ability to generalize and make accurate predictions for unseen data
Trang 33Figure 3: Random Forest Simplified [36]
The final prediction from the Random Forest algorithm is obtained through aggregating the predictions of individual decision trees This ensemble approach ensures improved generalization, increased stability, and reduced overfitting compared to single decision tree
models
2.2.5 XGBoost
XGBoost (Extreme Gradient Boosting) is a powerful and widely used machine learning algorithm known for its exceptional performance and scalability It belongs to the gradient boosting family, where weak prediction models, such as decision trees, are sequentially built to correct the errors made by previous models [37] XGBoost incorporates regularization techniques, including L1 and L2 regularization, to prevent overfitting and improve model generalization It also offers a feature importance measure, allowing users
to identify the most influential variables in their datasets Additionally, XGBoost handles missing values effectively without the need for imputation techniques With its parallel processing capabilities, XGBoost efficiently utilizes multiple CPU cores during training, enabling faster model building The algorithm employs tree pruning to control model complexity by removing unnecessary branches, and it supports cross-validation for assessing performance and hyperparameter tuning Being an open-source library, XGBoost
is available in various programming languages, making it accessible and widely adopted
by the data science community
In summary, XGBoost stands out as a robust machine learning algorithm due to its exceptional performance, scalability, and advanced features Its regularization techniques, feature importance measure, and handling of missing values contribute to model accuracy and interpretability With parallel processing and tree pruning, XGBoost offers efficient and controlled model building The algorithm's support for cross-validation aids in
Trang 34assessing performance and selecting optimal hyperparameters Moreover, its open-source nature and availability in multiple programming languages make XGBoost accessible and widely used in both academic research and practical applications
2.2.6 Nature Language Processing-NLP
in the vector space This dense representation captures semantic relationships, allowing algorithms to understand the meaning of words and infer relationships For example, words like "king" and "queen”, or "man" and "woman" have similar vector representations, enabling algorithms to perform word analogy tasks Furthermore, word embeddings capture contextual similarities by assigning similar vector representations to words that appear in similar contexts This contextual understanding enhances the performance of algorithms in various NLP tasks
Figure 4: The skip-gram model [39]
Word embeddings have found extensive applications in NLP tasks such as sentiment analysis, machine translation, text classification, and information retrieval By utilizing word embeddings, algorithms can leverage the semantic and contextual relationships
Trang 35between words to improve accuracy and performance Pre-trained word embeddings like GloVe and FastText are available and provide a solid starting point for NLP tasks These embeddings are trained on large corpora and capture general language semantics However, it is also possible to train domain-specific word embeddings using specific datasets to capture domain-specific semantics and contextual information This flexibility allows NLP practitioners to tailor word embeddings to the specific requirements of their tasks and achieve better results
In summary, word embedding is a fundamental technique in NLP that captures semantic and contextual relationships between words by representing them as dense vectors in a continuous vector space The ability to encode semantic and contextual information within these vector representations has transformed the field of NLP, enabling algorithms to understand and process textual data more effectively By capturing word relationships and context, word embeddings have proven invaluable in a wide range of NLP applications, contributing to improved accuracy and performance As the field of NLP continues to advance, word embedding techniques will play a crucial role in further enhancing the capabilities of natural language understanding and processing systems
The Transformer architecture has emerged as a significant breakthrough in natural language processing (NLP), revolutionizing the field by introducing a self-attention mechanism that captures word dependencies without traditional recurrent or convolutional structures This groundbreaking approach allows the model to attend to all positions in the input sequence simultaneously, enabling efficient parallelization and effective handling of long-range dependencies [40] As a result, the Transformer has achieved remarkable success in various NLP tasks, including machine translation, text generation, and language understanding, surpassing previous state-of-the-art results
At the core of the Transformer architecture is the self-attention mechanism, which fundamentally changes the way models process sequential data By employing an encoder-decoder structure with multiple layers, each comprising a self-attention module and a position-wise feed-forward neural network, the Transformer enables the model to understand both global and local dependencies within the input sequence This comprehensive understanding, combined with the ability to perform non-linear transformations between positions, empowers the Transformer to capture intricate linguistic patterns and relationships
The parallelization-friendly design of the Transformer architecture has further contributed
to its success Unlike traditional sequential models, such as recurrent neural networks (RNNs), the Transformer can process the entire input sequence in parallel This characteristic leverages the computational power of modern hardware, such as GPUs, leading to faster training and inference times, particularly for longer sequences Moreover,
Trang 36the Transformer's capacity to learn from vast amounts of data has made it a preferred choice for NLP tasks, where large-scale datasets are often available
In summary, the Transformer architecture has reshaped the NLP landscape by offering a powerful and efficient alternative to traditional sequence models Its ability to capture word dependencies through self-attention, along with its parallelization-friendly design, has propelled it to achieve state-of-the-art results in various NLP tasks With its exceptional performance, the Transformer continues to drive advancements in machine translation, text generation, and language understanding, and its impact on the field is likely to endure
Figure 5: Transformer structure [41]
BERT is a new approach to pre-training language representations that achieves the-art results on eleven natural language processing tasks BERT is based on the Transformer architecture, which is a neural network architecture that has been shown to be effective for sequence-to-sequence tasks BERT is pre-trained on a massive dataset of text and code The pre-training process involves masking some of the tokens in the input and then predicting the missing tokens BERT can be fine-tuned on a variety of natural language processing tasks Fine-tuning involves training a BERT model on a specific task, using the pre-trained BERT representations as a starting point [42]
Trang 37state-of-Figure 6: Overview of BERT [43]
BERT is a powerful language representation model that can be used for a variety of natural language processing tasks BERT has been shown to be effective for tasks such as question answering, natural language inference, and sentiment analysis BERT is a valuable tool for researchers and developers who are working on natural language processing tasks
BERT LARGE
BERT LARGE, an expanded version of the BERT model, takes the power of BERT to new heights With 24 transformer layers, 16 attention heads, and 340 million parameters, BERT LARGE offers an even more comprehensive and nuanced understanding of natural language This increased model size allows BERT LARGE to capture intricate language patterns, semantic relationships, and context at a greater depth
Similar to its base model, BERT LARGE utilizes the pre-training and fine-tuning paradigm During pre-training, BERT LARGE is trained on a vast corpus of text and code, learning to predict missing tokens by leveraging the masked language model objective This process helps BERT LARGE develop a rich representation of language, enabling it to grasp the nuances and complexities of various linguistic tasks
After pre-training, BERT LARGE can be fine-tuned on specific natural language processing tasks Fine-tuning involves training BERT LARGE on a task-specific dataset, utilizing the pre-trained representations as a starting point This approach allows BERT LARGE to adapt its knowledge to the specifics of the target task, resulting in highly accurate and effective models
BERT LARGE has achieved state-of-the-art results on a wide range of natural language processing tasks It has demonstrated exceptional performance in tasks such as question answering, natural language inference, sentiment analysis, and more With its extensive
Trang 38capacity for understanding and representing language, BERT LARGE serves as a valuable tool for researchers and developers working in the field of natural language processing
Fin-BERT
Fin-BERT is a pre-trained NLP model to analyze sentiment of financial text It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification
Fin-BERT is an opensource pre trained Natural Language Processing (NLP) model, that has been specifically trained on financial data, and outperforms almost all other NLP techniques for financial sentiment analysis [44]
The main advantage of Fin-BERT is its ability to understand financial jargon, context, and nuances, which are often unique to the finance industry It can comprehend financial terms, identify sentiment, classify financial news articles, extract financial entities, and perform various other financial text analysis tasks This makes Fin-BERT a valuable tool for sentiment analysis, stock market prediction, risk assessment, financial news summarization, and other applications in the financial domain
Fin-BERT has gained popularity and widespread adoption in the finance industry due to its ability to handle the complexities and challenges specific to financial text analysis By leveraging the pre-trained BERT model and fine-tuning it on financial data, Fin-BERT offers a reliable and efficient solution for extracting insights and understanding the financial landscape from textual information
Trang 39CHAPTER 3: METHODOLOGY
This chapter describe the process of exploring the dataset and approaches building our models to predicting stock prices movement of 5 technology companies, namely Apple, Amazon, Google, Microsoft, Tesla, by combining financial news data and historical stock price data The dataset used consists of two main components: financial news data and historical stock price data, collected from APIs services
We use 4 common model for stock and time series data for exploration Then form an ensemble model based on the model with the best result on the test set
Ensemble model is the combining input of 4 pre-trained models, Support Vector Machine, Random Forest, XG Boost and LSTM, and optionally with a textual embedding module from the FinBert model as additional input for final prediction
3.1 Data Collection
The dataset used for our model consists of two main components: financial news data and historical stock price data To gather the financial news data, APIs such as Finhub, Benzinga, and Google News are utilized These APIs provide access to a variety of news sources and allow us to collect news articles and related information daily
On one hand, the historical stock price data is obtained using the basic stock price reporting service offered by Alpha Vantage This service allows for easy retrieval of stock prices for five technology companies, including Apple, Amazon, Google, Microsoft, and Tesla, through an API The historical stock prices for these companies are collected over a span
of more than 10 years and stored in a CSV file This extensive historical data will be utilized for analysis and modeling purposes
One the other hand, when it comes to the news data, there are limitations in terms of consistent availability throughout the years Especially for our model, it requires a minimum of five relevant articles for each day to generate meaningful insights Therefore,
to ensure an adequate amount of news data, we have decided to narrow down the data span and focus on a specific range from July 1, 2022, to May 1, 2023 This time frame provides
a more reliable and consistent dataset for training and evaluation purposes Hence, the news and stock prices are needed to be collected for the same time period Finally, we were using the sliding window technique to form time series data, splitting into train, validation, test sets, and stratify the train and validation for balancing classes, so that the model won’t bias toward one result
3.1.1 Historical stocks data
In the world of finance and investing, understanding stock data and utilizing indicators are essential for making informed decisions Stock data provides valuable insights into the historical performance, trends, and volatility of individual stocks or broader market indices
Trang 40Indicators, on the other hand, serve as quantitative tools that help interpret and analyze stock data, providing valuable signals for traders and investors
High, Low, Close, Open, and Volume are fundamental data points used in the analysis of stock prices Each of these data points provides specific information about a stock's trading activity during a given period, typically a day For terms of basic stocks data, it includes:
1 High: The high price represents the highest price at which a stock traded during a specific period, such as a trading day It indicates the maximum level reached by the stock's price during that period Traders and investors use the high price to assess the stock's upward price movement and potential resistance levels
2 Low: The low price represents the lowest price at which a stock traded during a specific period, like a trading day It reflects the minimum level reached by the stock's price during that period Traders and investors use the low price to assess the stock's downward price movement and potential support levels
3 Close: The close price represents the final price at which a stock traded at the end
of a specific period, such as a trading day It is the last recorded price before the market closes The close price is significant as it provides insights into the stock's overall performance for the period It is often used to calculate various technical indicators and is considered a crucial reference point for traders and investors
4 Open: The open price represents the price at which a stock begins trading at the start
of a specific period, typically a trading day It is the first recorded price for the day The open price can be important as it sets the initial benchmark for the stock's price movement It helps traders and investors analyze the stock's initial market sentiment and can be used in conjunction with other data points to assess the stock's price action
5 Volume: Volume refers to the total number of shares or contracts traded for a particular stock during a specific period, such as a trading day It represents the level
of market activity and liquidity for the stock Volume can be a crucial indicator as
it provides insights into the level of participation and interest in a stock Higher volume often indicates increased market activity, while lower volume may suggest reduced interest or limited trading activity
Stock data comprises historical information about a company's stock, including its price, volume, and other relevant metrics Analyzing this data can provide valuable insights into the behavior of a stock, enabling investors to identify patterns, trends, and potential opportunities Historical stock data allows for the calculation of various financial metrics, such as returns, volatility, and correlations It forms the foundation for conducting in-depth analysis, developing investment strategies, and assessing the performance of portfolios
In our project, we define 2 types of stocks data: