Several machine learning models have been proposed for stock price forecasting, I propose a framework based on the Long Short-Term Memory LSTM and AutoRegressive Integrated Moving Averag
GENERAL INTRODUCTION
Research Problem
The stock price trending prediction problem is a fairly complex problem and different techniques can be used appropriately to achieve good prediction accuracy Prediction and analysis of the stock market are some of the most complicated tasks to do There are several reasons for this, such as market volatility and so many other dependent and independent factors for deciding the value of a particular stock in the market These factors make it very difficult for any stock market analyst to predict the rise and fall with high accuracy degrees
In this topic, I perform time series forecasting that is used to solve sequence problems Time series forecasting refers to the type of problem where I have to predict an outcome based on time dependent inputs A typical example of time series data is stock market data where stock prices change with time In technical analysis for predicting stock price movement direction, machine learning models can be trained to forecast the stock movement or the direction of prices through an analysis of historical data With the advent of machine learning and its robust algorithms, the latest market analysis and stock market prediction developments have started incorporating such techniques in understanding stock market data In short, machine learning algorithms are being used widely by many organizations in analyzing and predicting stock values Many analysts and researchers have developed tools and techniques that predict stock price movements and help investors in proper decision-making
Researchers are now able to predict stock prices with higher accuracy due to analytical predictive models These predictive techniques utilize data from previous stock price movements and look for patterns that could indicate future stock price changes in the market The prediction of the stock market has entered a technologically advanced era with the advent of technological marvels such as global digitization The novelty of the proposed study is the development of the robustness time series model based on deep learning for forecasting future values of stock marketing
Financial time series forecasting has never been an easy task due to its sensibility to political, economic, and social factors For this reason, people who invest in stock markets are usually looking for robust models that can ensure they maximize their profile and minimize their losses as much as possible Recently, various studies have speculated that a special type of artificial neural networks called Recurrent Neural Networks (RNNs) could improve the predictive accuracy of the behavior of financial data over time
In this topic, I am going to predict the closing price of a particular company using the Long Short-Term Memory (LSTM) model and the AutoRegressive Integrated Moving Average (ARIMA) model in Python.
Objectives of the Topic
The purpose of this study was to approach, analyze, and apply machine learning for Vietnam stock market closing price trending prediction
Investors in the stock market seek effective methods to forecast price movements and optimize their investments In response, researchers strive to develop innovative techniques to meet this demand, enabling investors to navigate the volatile market and achieve their financial goals.
Furthermore, I developed a Web application with the capability of predicting the direction in which stock market prices will move based on trading time series as inputs
The use of these machine learning techniques will enable investors to make better-informed choices, allow investors to make better decisions, and invest more wisely by maximizing their returns and minimizing their losses.
Scope of the Study
In stock market prediction, analyzing past time series data is crucial for forecasting future trends Machine learning models are employed for this purpose, with a particular focus on Recurrent Neural Networks (RNNs) and their variant Long Short-Term Memory (LSTM) LSTM is particularly effective in handling sequential data, enabling it to capture long-term dependencies and make accurate predictions.
Besides, I propose a framework based on the Long Short-Term Memory (LSTM) and AutoRegressive Integrated Moving Average (ARIMA) model to predict the closing prices of FPT Company and some others These predictions were made using data collected based on its stock closing prices and volume values of the past ten years or more
The forecasting models are evaluated using metrics such as the Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) measures were used in the computation of the findings of the machine learning stock prediction models
Before I get into the program‟s implementation to predict the stock market values, let me visualize the data on which I will be working Here, I will be analyzing the stock value of FPT Company from the Ho Chi Minh Stock Exchange (HOSE), formerly known as HCM Securities Trading Center, which is a stock exchange in Ho Chi Minh City, Vietnam The stock value data will be presented, connected from “vnstock”
+ vnstock is a Python package to retrieve the Vietnam stock market data from TCBS (tcbs.com.vn) and SSI (ssi.com.vn)
+ vnstock allows the user to download historical, intraday stock data and market insights from TCBS
+ vnstock is relying on public/private APIs to provide stock data
+ After data extraction, I can save the files in the form of a Comma Separated File (.csv) or Excel format
For example: FPT Company has their stocks registered in HOSE and has its values updated during every working day of the stock market (Note that the stock market doesn‟t allow trading to happen on Saturdays, Sundays, and Holidays) For each date, the Opening value of the stock, the Highest and Lowest values of that stock on the same days are noted, along with the Closing value at the end of the day (TradingDate) Additionally, the total Volume of the stocks in the market is also given With these data, it is up to the work of Machine Learning to study the data and implement several algorithms that can extract patterns from the FPT stock‟s historical data
THEORETICAL BASIC AND RELATED RESEARCH
Machine Learning Stock Market Prediction Studies: Review and Research
Stock market investment strategies are complex and rely on an evaluation of vast amounts of data In recent years, machine learning (ML) techniques have increasingly been examined to assess whether they can improve market forecasting when compared with traditional approaches The objective of this study is to identify directions for future ML stock market prediction research based on a review of current literature A systematic literature review methodology is used to identify relevant peer-reviewed journal articles from the past twenty years and categorize studies that have similar methods and contexts
Four categories emerge artificial neural network studies, support vector machine studies, studies using genetic algorithms combined with other techniques, and studies using hybrid or other artificial intelligence approaches Studies in each category are reviewed to identify common findings, unique findings, limitations, and areas that need further investigation The final section provides overall conclusions and directions for future research
The objective of this study is to identify directions for future ML stock market prediction research based on a review of current literature A systematic literature review methodology is used to identify relevant peer-reviewed journal articles from the past twenty years, evaluate and categorize studies that have similar methods and contexts, and then compare the studies in each category to identify common findings, unique findings, limitations, and areas that need further investigation This will provide artificial intelligence and finance researchers with directions for future research into the use of ML techniques to predict stock market index values and trends
1.2 Method For Identifying Relevant Studies
Each researcher involved in this study conducted an independent search for peer-reviewed journal articles where some form of ML was used to predict a stock market related outcome Articles were found using Google Scholar,
EBSCO, and EconLit To identify findings that are relevant to today‟s IT environment, only studies from the past twenty years (1999-2019) were included in the final list Each study used one, or more, ML techniques to predict stock market index values or expectations for whether the future index value will rise or fall
Researchers meticulously analyzed each paper to identify clusters of cohesive studies employing a specific ML technique or a hybrid/multi-method approach This comprehensive evaluation led to the development of a structured taxonomy for ML stock market research, categorizing studies based on their methodological similarities.
Each of the articles fits into one of the following four categories: (1) Artificial Neural Network studies, (2) Support Vector Machine studies, (3) studies using Genetic Algorithms with other techniques, and (4) studies using Hybrid or other Artificial Intelligence approaches
Figure 2.1 Machine Learning Stock Market Prediction Study Research Taxonomy [1]
Studies Using Artificial Neural Networks to Predict Stock Market Values: The first set of articles includes studies that primarily focus on stock market prediction using artificial neural networks (ANNs)
+ Jasic and Wood (2004) developed an artificial neural network to predict daily stock market index returns using data from several global stock markets The focus is on trying to support profitable trading
+ Chavan and Patil (2013) contribute to our understanding of ANN stock market prediction by surveying different model input parameters found in nine published articles
+ Chong, Han, and Park (2017) analyze deep learning networks for stock market analysis and prediction
Artificial Neural Network (ANN) Stock Market Studies
Studies Using Support Vector Machines to Analyze Stock Markets: The second group of articles includes studies primarily using support vector machines (SVMs) to make stock market predictions
+ In the context of stock market prediction, according to Schumaker and Chen (2010), SVM is a machine learning algorithm that can classify a future stock price direction (rise or drop)
+ Lee (2009) developed a prediction model based on a SVM with a hybrid feature selection method to predict the trend of stock markets
+ A unique study by Schumaker and Chen (2009) used an SVM in conjunction with textual analysis looking at the impact of news articles on stock prices
As illustrated in the first two study categories, systems primarily based on ANNs or SVMs have had some success improving stock market value prediction but, over time, there appears to be an increasing interest in trying to further improve results using multi-technique approaches
In a pioneering study by Kim and Han (2000), a genetic algorithm approach was employed to optimize feature discretization and connection weights in artificial neural networks This innovative technique aimed to enhance the accuracy of stock price index predictions.
+ Kim, Min, and Han (2006) developed a unique hybrid system using an ANN and GA to predict stock market index values
+ Kim and Shin (2007) investigate the effectiveness of a hybrid ANN and GA method for stock market prediction
ANNs, SVMs, or multi-method GA approaches are some of the most common techniques for tackling the problem of stock market prediction This final category describes studies that have used other unique, or multi-method, artificial intelligence techniques in this problem domain
Support Vector Machine (SVM) Stock Market Studies
Genetic Algorithm (GA) with Other Techniques Stock Market Studies
Other Artificial Intelligence (AI) Method Stock Market Studies
+ Rule-based expert systems have been used for decades to provide domain- specific knowledge to novice decision-makers Lee and Jo (1999) developed a candlestick chart analysis expert system for predicting the best stock market timing The expert system includes patterns and rules that can predict future stock price movements
+ Defined patterns are classified into five forms of price movements: falling, rising, neutral, trend continuation, and trend-reversal patterns The experimental results revealed that the knowledge base they developed could provide indicators to help investors get higher returns from their stock investments
1.4 Conclusions And Future Research Directions
The objective of this study is to identify directions for future machine learning (ML) stock market prediction research based on a review of current literature Given the ML-related systems, problem contexts, and findings described in each selected article
There is a strong link between machine learning methods and the prediction problems they are associated with This is analogous to task-technology fit (Goodhue and Thompson, 1995) where system performance is determined by the appropriate match between tasks and technologies
Artificial neural networks (ANN) are best used for predicting numerical stock market index values
Support vector machines (SVM) best fit classification problems such as determining whether the overall stock market index is forecast to rise or fall
Genetic algorithms (GA) use an evolutionary problem-solving approach to identify higher quality system inputs, or predict which stocks to include in a portfolio, to produce the best returns
Stock Market Prediction Using Machine Learning Techniques: A Decade (2011 - 2021) Survey on Methodologies, Recent Developments, and Future Directions
This study explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of a generic framework Findings from the last decade (2011 - 2021) were critically analyzed, having been retrieved from online digital libraries and databases like ACM digital library and Scopus
An extensive comparative analysis was carried out to identify the direction of significance The study would help emerging researchers understand the basics and advancements of this emerging area, thus carrying on further research in a promising direction
This article reviewed studies based on a generic framework of SMP, as presented in Figure 2.2 below It mainly focused on the studies from the last decade (2011 - 2021) The studies were analyzed and compared based on the type of data used as the input, the data pre-processing approaches, and the machine learning techniques used for the predictions
Moreover, an extensive comparative analysis was performed, and it was concluded that SVM is the most popular technique used for SMP However, techniques like ANN and DNN are mostly used, as they provide more accurate and faster predictions Furthermore, the inclusion of both market data and textual data from online sources improves the prediction accuracies
To gather relevant literature on stock market prediction (SMP) using machine learning, a comprehensive search was conducted across search engines, digital libraries, and databases such as Google Scholar, Research Gate, and ACM Digital Library The search query "stock market prediction using machine learning" yielded a substantial collection of academic articles, conference proceedings, and research reports.
„IEEE Explore‟, „Scopus‟, and so on During the process of literature collection, various phrases like “stock market prediction methods”, “impact of sentiments on stock market prediction”, and “machine learning-based approach for stock market prediction” were keyed
As a result, some of the fundamental papers in the field of stock market prediction were retrieved By the careful analysis of a few basic papers, a primary insight into the domain was obtained The search criteria were further modified to collect the literature of the last decade, to enhance and improve the domain The literature selected was screened by applying quality criteria, where metrics such as indexing, quartiles, impact factors, and publishers were observed
2.3 Generic Scheme for Stock Market Prediction (SMP)
Figure 2.4 below describes the generic process involved in SMP The process starts with the collection of the data, and then pre-processing that data so that it can be fed to a machine learning model
Types of data: The prediction models generally use two types of data: market data and textual data
+ Market data are the temporal historical price-related numerical data of financial markets Analysts and traders use the data to analyze the historical trends and the latest stock prices in the market They reflect the information needed for the understanding of market behavior
+ Textual data is used to analyze the effect of sentiments on the stock market Public sentiments have been proven to affect the market considerably The most challenging part is to convert the textual information into numerical values so that it can be fed to a prediction model
Data Pre-Processing: Once the data is available, it needs some pre- processing so that it can be fed to a machine learning model
Machine Learning Methods: After the data is pre-processed and transformed to a standard representation, it is fed to machine learning models for further processing
This article presented the distribution of various machine learning techniques used in literature so far, such as:
Figure 2.4 Generic Scheme for Stock Market Prediction [2]
Generally, two approaches are used for SMP: classification and regression The former approach classifies the market trend as Up and Down For the latter, output is a numerical value predicting the ups and downs of the price Figure 2.5 presents the taxonomy of the evaluation metrics used in the studies so far Points out the different evaluation parameters used in the reviewed studies, as well as the time frame of the prediction For the most part, the studies used accuracy as an evaluation metric, which is the percentage ratio of correct predictions over the total number of test instances
Moreover, other metrics like Mean Square Error (MSE), the Area Under Curve (AUC), Akaike Information Criterion (AIC), R-squared (R2),
Precision, Recall, F-measure, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are used as well
+ The MSE measures the mean difference between the predicted and actual output It is an important metric in regression analysis because it measures how close the predicted value is to the actual value
+ The MAE measures the average difference between the predicted and actual data
+ The MAPE is used as a performance indicator in a few studies that measure the mean of absolute error percentages in predictions
In terms of price predictions, the model outperformed the RNN and achieved a lower MSE and MAE compared to the constituent models
Figure 2.5 Taxonomy of the performance metrics [2]
The distribution of the number of papers published in recent years is presented in Figure 2.6 The number of publications increased from 2009 and was at its peak in 2019, but over the previous two years, the publication number was low
The distribution of machine learning algorithms used for SMP is shown in Figure 2.7, where the SVM was the most popular technique used However, the ANN and DNN have attracted the research community‟s attention for the last few years The deep learning approaches are used to analyze complicated patterns in the stock data and provide much faster results
The comparative analysis between the type of data used and the performance of the models is represented in Figure 2.8 Data alone from social media does not perform better than using market data and technical indicators However, if data from textual sources is combined with them, then the model performance increases
Figure 2.6 Number of publications per year [2]
-15- Figure 2.7 Distribution of the SMP techniques [2]
Figure 2.8 Comparison of the accuracies with different types of data [2]
Stock Price Prediction Using LSTM (Related Research)
The study of the share is carried out in this paper and it can be carried out for several shares in the future The prediction could be more reliable if the model trains a greater number of datasets using higher computing capacities, an increased/decreased number of layers, and LSTM modules, hyperparameters tuning The prediction of stock value is a complex task that needs a robust algorithm background to compute the longer term share prices Stock prices are correlated within the nature of the market; hence it will be difficult to predict the costs The proposed algorithm uses the market data to predict the share price using machine learning techniques like a recurrent neural network named Long Short Term Memory, in that process weights are corrected for each data point using stochastic gradient descent This system will provide accurate outcomes in comparison to currently available stock price predictor algorithms
Algorithm: Stock prediction using LSTM [3]
Output: Prediction of stock price using price variation
Step 2: Data Preprocessing after getting the historical data from the market for a particular share
Step 3: Import the dataset to the data structure and read the open price
Step 4: Do a feature scaling on the data so that the data values will vary from 0 and 1
Step 5: Create a data structure with 60 timestamps and 1 output
Step 6: Building the RNN (Recurrent neural network) for the Step 5 data set and Initialize the RNN by using a sequential repressor
Step 7: Adding the first LSTM layer and some Dropout regularization for removing unwanted values
Step 8: Adding the output layer
Step 9: Compiling the RNN by adding Adam optimization and the loss as mean_squared_error
Step 10: Making the predictions and visualizing the results using plotting techniques
Pipeline and Workflow for Building Machine Learning Model
Figure 2.9 Pipeline and workflow for building machine learning model
Detail View of Machine Learning Modelling Process
Figure 2.10 Detail view of machine learning modelling process
Training, Testing, and Validation Datasets
A dataset is used not only for training purposes A single training dataset that has already been processed is usually split into several parts, which is needed to check how well the training of the model went For this purpose, a testing dataset is usually separated from the data Next, a validation dataset, while not strictly crucial, is quite helpful to avoid training my algorithm on the same type of data and making biased predictions
The training dataset used to fit the model can be further split into a training set and a validation set and it is this subset of the training dataset, called the validation set, that can be used to get an early estimate of the skill of the model
Training Dataset: The sample of data used for learning and used to fit the model The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set and testing set
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters The evaluation becomes more biased as a skill on the validation dataset is incorporated into the model configuration
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset
The data that I used for this topic was downloaded from “vnstock” To evaluate the performance of the algorithm, download the actual stock prices and volume values from 1st January 2013 to 31 December 2023 I will be predicting the stock closing price trending of the FPT Corporation (HOSE: FPT), based on its stock closing prices and volume values for the past 10 years
I tried 3 methods below with my dataset and I realized that method 1 is suitable for my data
Figure 2.11 Methods to split datasets: Training, Testing, and Validation
Figure 2.12 Split datasets and comparison methods [4]
I tried and chose method 1 Predict the FPT stock closing prices based on historical data of the past 10 years, from 1st Jan 2013 to 31 Dec 2023
Full Dataset (100%): From 2013-01-02 To 2023-12-29 (2745 rows, 7 columns) + FullTrain_data(90%): From 2013-01-02 To 2022-11-25 (2470 rows, 7 columns) ++ Train_data (80%): From 2013-01-02 To 2021-10-21 (2195 rows, 7 columns) ++ Valid_data (10%): From 2021-10-22 To 2022-11-25 ( 275 rows, 7 columns) + Test_data (10%): From 2022-11-28 To 2023-12-29 ( 275 rows, 7 columns)
3 high Highest price in session
4 low Lowest price in session
6 volume Total share volume in session
Long Short Term Memory (LSTM) Model
To develop a Machine Learning model to predict stock prices, I will be using the technique of Long Short Term Memory By definition, Long Short Term Memory - usually just called “LSTM” - is a special kind of Recurrent Neural Network (RNN) architecture used in deep learning, capable of learning long- term dependencies They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularized by many people in the following work They work tremendously well on a large variety of problems and are now widely used
A LSTM module (or cell) has 5 essential components that allow it to model both long-term and short-term data
Cell state: This represents the internal memory of the cell which stores both short-term memory and long-term memories
Hidden state: This is output state information calculated by current input, previous hidden state, and current cell input which I eventually use to predict the future stock market prices Additionally, the hidden state can decide to only retrieve the short or long-term or both types of memory stored in the cell state to make the next prediction
Input gate: Decides how much information from current input flows to the cell state
Forget gate: Decides how much information from the current input and the previous cell state flows into the current cell state
Output gate: Decides how much information from the current cell state flows into the hidden state, so that if needed LSTM can only pick the long- term memories or short-term memories and long-term memories
A Recurrent Neural Network (RNN) is an advanced form of neural networks that has internal memory that makes RNN capable of processing long sequences This makes RNN very suitable for stock price prediction, which involves long historical data
RNN can provide a considerably good prediction for the temporal stock data The hidden states of RNN are given by Equations (1) and (2)
Where: x t is the input vector at time t; b and c are bias values;
W, U, and V denote input-to-hidden, hidden-to-hidden, and hidden-to- output weight matrices, respectively
While working with time-series data (like the stock market), an attention mechanism can be utilized that can divide the given data into parts so that the decoder can utilize specific parts while generating new values Figure 2.13 shows the generalized RNN architecture
LSTM networks are modified RNNs that excel at retaining long-term dependencies due to their ability to overcome the vanishing gradient problem This architecture is ideal for tasks involving classification and time series prediction, where time lags may vary in duration LSTM consists of five primary components and employs back-propagation for model training.
Figure 2.14 A single cell LSTM architecture [11]
Cell state ( c t ) - 1D vector of fixed shape with random value initialization It contains the information that was present in the memory after the previous time step
Forget gate ( f t ) - changes the cell state, intending to eliminate non-important values from previous time steps This helps the LSTM network to forget the irrelevant information that does not have any impact on the future price prediction
Input gate ( i t ) - changes the cell state with the aim of adding new information about the current time step It adds new information that may affect the stock price movement
Output gate ( o t ) - decides what the next hidden state should be The new cell state and the new hidden are then carried over to the next time step Returns the final relevant information, which will be used for stock price prediction
Hidden state ( h t ) - it is calculated by multiplying the output gate vector by the cell state vector
The values of these vectors are calculated by the Equations (3) – (7) i t = σ (W (i) x t + U (i) h t – 1 + b (i) ) (3) f t = σ (W (f) x t + U (f) h t – 1 + b (f) ) (4) o t = σ (W (o) x t + U (o) h t – 1 + b (o) ) (5) c t = i t ⊙ u t + f t ⊙ c t − 1 (6) h t = o t ⊙ tanh(c t ) (7)
Where: x t is input vector, c t−1 is previous cell state, h t−1 is previous hidden state,
W and U are input-to-hidden and hidden-to-hidden weight matrices, σ is the logistic sigmoid function, and
⊙ denotes the element-wise multiplication
AutoRegressive Integrated Moving Average (ARIMA) Model
• ARIMA stands for AutoRegressive Integrated Moving Average and represents a cornerstone in time series forecasting ARIMA model is a popular and widely used statistical method for analyzing and forecasting time series data It is a statistical method that has gained immense popularity due to its efficacy in handling various standard temporal structures present in time series data
ARIMA models forecast future values by analyzing a time series' past values (lags) and forecast errors This class of models extends AutoRegressive Moving Average models by incorporating the concept of integration, making them suitable for forecasting non-stationary time series that exhibit trends or seasonality ARIMA models establish relationships between current and past values, allowing for the prediction of future data points based on historical patterns.
It combines three key components to model data [7] a Autoregression (AR):
This component captures the influence of a series‟ past values on its future values In simpler terms, AR considers how past observations (lags) affect the current value It‟s denoted as AR(p), where „p‟ represents the number of lagged observations included in the model
This component relates the present value to its past values through a regression equation b Differencing (I for Integrated):
Stationarity is a crucial assumption for many time series analyses Differencing involves subtracting a previous value from the current value, often required to achieve stationarity The degree of differencing needed is denoted by I(d)
It involves differencing the time series data to make it stationary, ensuring that the mean and variance are constant over time c Moving Average (MA):
This component accounts for the effect of past forecast errors (residuals) on the current prediction It considers the average of past errors (lags) to improve the forecast accuracy MA is denoted by MA(q), where „q‟ represents the number of lagged errors incorporated in the model
This component uses the dependency between an observation and a residual error from a moving average model applied to lagged observations
8.2 Components of ARIMA a Autoregression (AR):
The autoregressive (AR) parameter (p) in an ARIMA model quantifies the dependence of the current observation on its preceding values Mathematically, an AR(p) model is expressed as:
An Autoregressive model is one where Y t depends only on its own lags
Y t is a function of the lags of Y t It is depicted by the above equation
• Y t−1 is the lag1 of the series,
• β 1 is the coefficient of lag1 that the model estimates,
• β 1 to β p are the autoregressive coefficients,
• α is the intercept term, also estimated by the model,
• ϵ t represents the error term at time t b Integrated (I):
The differencing part of ARIMA is represented by the parameter d It involves transforming a non-stationary time series into a stationary one by differencing consecutive observations The differencing operation can be applied multiple times until stationarity is achieved It means that the statistical properties of the time series do not change over time It helps in stabilizing the mean and removing trends from the time series The formula for differencing is straightforward:
• Y t' is the differenced series at time t
• Y t is the original series at time t
• Y t-1 is the value of the series at the previous time step c Moving Average (MA):
The moving average part (MA) of an ARIMA model is represented by the parameter q, which is also known as the order of moving average It indicates
-28- the dependence of the current observation on the previous forecast errors This component represents the effect of past error terms on the current value of the time series A MA(q) model is one where Y t depends only on the lagged forecast errors The following equation:
• Y t is the current observation, is the value of time series at time t
• ϵ t , ϵ t-1 , ϵ t-q are the noise terms or the error terms at time t, t-1, ,t-q
• ϕ 1 to ϕ q are the moving average parameters
ARIMA model combines all the AR, I, and MA components in it ARIMA model combines all the components mentioned above and the general formula for a non-seasonal ARIMA model is represented as ARIMA(p,d,q)
ARIMA model is one where the time series was differenced at least once to make it stationary and we combine the AR and the MA terms So the equation of an ARIMA model becomes:
• α is a constant or mean of the differenced series
• β 1 , β 2 , , β p are autoregressive parameters representing the dependence on past values
• ϵ t is the white noise error term at time t
• ϕ 1 , ϕ 2 , , ϕ p are moving average parameters representing the dependence on past forecast errors
Predicted Y t = Constant + Linear combination Lags of Y (upto p lags) + Linear combination of Lagged forecast errors (upto q lags)
8.4 ARIMA Modelling Procedure (Flow Chart)
Figure 2.15 General process for forecasting using an ARIMA model [10]
9.1 Root Mean Squared Error (RMSE)
RMSE is defined as the square root of the mean squared error Mean squared error is the average of the error squares, that evaluates the quality of a forecasting model Lower values indicate higher performance
• n refers to the total number of values in the test set
MAE is defined as the average of the absolute difference between predicted and actual values
• y refers to the actual values
• ŷ refers to the predicted values
• n represents the total number of values in the test set
When subtracting the predicted values from the actual values, obtaining errors, sum the absolute values of those errors and get their mean It gives a notion of the overall error for each prediction of the model, the smaller the better
Keras: is a deep learning API written in Python, and capable of running on top of the machine learning platform TensorFlow
Figure 2.16 Keras Flow Chart for Machine Learning
Keras provides a simple workflow for training and evaluating the models It is described in the following diagram I am creating the model and training it using the training data Once the model is trained, I take the model to perform inference on test data
IMPLEMENTATION AND STUDY RESULTS
Introduction
Time series analysis is a technique used to analyze data trends over time A key application is predicting future values based on historical observations In this work, a time series analysis is conducted using a recurrent neural network to forecast future stock closing prices of FPT Company (FPT) The model leverages historical stock closing prices and volume data from the past decade to make these predictions.
Dataset
To evaluate the performance of the models, the data that I used for this topic was downloaded from “vnstock” I downloaded the actual FPT stock closing prices and volume values from 2013-01-02 To 2023-12-29
For training, from 2013-01-02 to 2021-10-21 For valid, from 2021-10-22 to 2022-11-25 For testing, from 2022-11-28 to 2023-12-29 More details consist of:
Full Dataset (100%): From 2013-01-02 To 2023-12-29 + FullTrain_data(90%): From 2013-01-02 To 2022-11-25 ++ Train_data (80%): From 2013-01-02 To 2021-10-21 ++ Valid_data (10%): From 2021-10-22 To 2022-11-25 + Test_data (10%): From 2022-11-28 To 2023-12-29
The dataset contains 7 columns: Time (trading date), Open (the opening price), High (the highest price), Low (the lowest price), Close (the closing price), Volume (total volume) in session, and Ticker (stock code) I see that the trend is highly non-linear and it is very difficult to capture the trend using this information This is where the power of LSTM can be utilized LSTM is a type of recurrent neural network capable of remembering the past information and while predicting the future values, it takes this past information into account
Stock Closing Prices Trending Prediction (with LSTM model)
Stock closing price prediction is similar to any other machine learning problem where I am given a set of features and I have to predict a corresponding value I will perform the same steps as I perform to solve any machine learning problem Follow these steps:
3.1 Install, Import Libraries, and Import Dataset
• For this topic, the data has been stored in vnstock above I am interested in the closing price and volume of the stock Therefore, I will filter all the data from my dataset and retain only the values for the Close and Volume columns
3.2 Data Normalization and Convert Training Data to Right Shape
• When I use a neural network, I should normalize or scale my data I use the MinMaxScaler class from the sklear.preprocessing library to scale my data between 0 and 1 The feature_range parameter is used to specify the range of the scaled data
• In a time series problem, I have to predict a value at time T, based on the data from days T-N where N can be any number of steps In this topic, I am going to predict the stock closing price of the data based on the stock closing prices and volumes for the past 60 days I have tried and tested different numbers and found that the best results are obtained when past
• My feature set should contain the closing stock price values and volume values for the past 60 days while the label or dependent variable should be the stock price at the 61st day In two lists I created: feature_set and labels There are 639 records in the training data I execute a loop that starts from the 61st record and stores all the previous 60 records in the feature_set list The 61st record is stored in the labels list I need to convert both the feature_set and the labels list to the numpy array before
I can use it for training
• To train LSTM on my data, I need to convert my data into the shape accepted by the LSTM I need to convert my data into three-dimensional format The first dimension is the number of records or rows in the
-34- dataset which is 639 in my case The second dimension is the number of time steps which is 60 while the last dimension is the number of indicators Since I am using two features, i.e Close and Volume, the number of indicators will be 2
After preprocessing the data to ensure the desired format, a sequential LSTM model is constructed with multiple layers The model comprises four LSTM layers followed by a dense layer responsible for predicting future stock prices.
• Let's first import the libraries that I am going to need to create my model: Sequential, Dense, LSTM, and Dropout I imported the Sequential class from keras.models library and Dense, LSTM, and Dropout classes from keras.layers library As a first step, I need to instantiate the Sequential class This will be my model class and I will add LSTM, Dropout, and Dense layers to this model
3.4 Creating LSTM, Dropout Layers, and Dense Layer
• Add the LSTM layer to the model that I just created To add a layer to the sequential model, the “add” method is used Inside the add method, I passed my LSTM layer The first parameter to the LSTM layer is the number of neurons or nodes that I want in the layer The second parameter is return_sequences, which is set to true since I will add more layers to the model The first parameter to the input_shape is the number of time steps while the last parameter is the number of indicators Let's add a dropout layer to my model The dropout layer is added to avoid over-fitting, which is a phenomenon where a machine learning model performs better on the training data compared to the test data Add three more LSTM and dropout layers to my model
• To make my model more robust, I add a dense layer at the end of the model The number of neurons in the dense layer will be set to 1 since I want to predict a single value in the output
3.5 Model Compilation and Algorithm Training
To prepare the LSTM for training, the `compile` method is invoked on the `Sequential` model The loss function chosen is the mean squared error, and the Adam optimizer is employed for efficient loss reduction and algorithm optimization.
• To train the model that I defined in the previous few steps, I call the fit method on the model and pass it my training features and labels
• I have trained my LSTM, now is the time to test the performance of my algorithm on the test set by predicting the stock closing prices from 28 Nov 2022 to 29 Dec 2023 I converted my test data into the right format
I imported my test data and I removed all the columns from the test data except the columns that contain stock closing prices and volume values 3.7 Converting Test Data to Right Format
• I want my feature set to contain the stock closing prices and volume values for the previous 60 days For the 28 Nov 2022, I need the stock prices for the previous 60 days To do so, I concatenated my training data and test data before preprocessing
Hyperparameters Tuning
Early stopping is a kind of cross-validation strategy where I keep one part of the training set as the validation set When I see that the performance on the validation set is getting worse, I immediately stop the training on the model This is known as early stopping
In the above images, I will stop training at the dotted line since after that my model will start overfitting on the training data
In Keras, I can apply early stopping using the callbacks function Below is the sample code for it from keras.callbacks import EarlyStopping
EarlyStopping(monitor='val_loss', patience=5)
monitor denotes the quantity that needs to be monitored and „val_loss‟ denotes the validation error
patience denotes the number of epochs with no further improvement after which the training will be stopped For a better understanding, let‟s take a look at the above image again After the dotted line, each epoch will result in a higher value of validation error Therefore, 5 epochs after the dotted line (since our patience is equal to 5), my model will stop because no further improvement is seen
Tuning the patience hyperparameter requires caution Patience defines the number of epochs allowed before early stopping triggers While a patience of 5 epochs is common, it may not always be optimal The model's performance may improve after the initial 5 epochs, with the validation error decreasing Therefore, carefully consider the patience value to ensure that early stopping does not prematurely halt the training process.
-40- These are some results after tuning and the final image is the best
Stock Closing Prices Trending Prediction (with ARIMA model)
Time series analysis refers to the analysis of changes in the trend of the data over some time Time series analysis has a variety of applications One such application is the prediction of the future value of an item based on its past values In this topic, I will perform a time series analysis with the help of a recurrent neural network I will be predicting the future stock closing prices of the FPT Company (FPT), based on its stock closing prices and volume values of the past 10 years
To evaluate the performance of the ARIMA model, the data that I used for this topic was downloaded from vnstock I downloaded the actual FPT stock prices and volume values from 2013-01-02 to 2023-12-29
For training, from 2013-01-02 to 2021-10-21 For valid, from 2021-10-22 to 2022-11-25 For testing, from 2022-11-28 to 2023-12-29 More details consist of:
Full Dataset (100%): From 2013-01-02 To 2023-12-29 + FullTrain_data(90%): From 2013-01-02 To 2022-11-25 ++ Train_data (80%): From 2013-01-02 To 2021-10-21 ++ Valid_data (10%): From 2021-10-22 To 2022-11-25 + Test_data (10%): From 2022-11-28 To 2023-12-29
3 high Highest price in session
4 low Lowest price in session
6 volume Total volume in session
Evaluation Metrics & Comparison Models
I used two metrics to evaluate the performance of the ARIMA and LSTM models, which respectively are RMSE and MAE These performance indicators are used to analyze the performance of different stocks predicted by LSTM and ARIMA models The 20 listed company stocks are selected as the research object in the experiments Table 3.2 displays statistical metrics for different stocks across various sectors/industries
Table 3.2 Evaluation Metrics & Comparison Models
Table 3.2 compares RMSE and MAE values for both LSTM and ARIMA models Lower values in these metrics indicate better predictive performance For most listed stocks, LSTM tends to have lower RMSE and MAE values (16 stocks) compared to ARIMA (4 stocks) In some cases where LSTM has higher RMSE and MAE values than ARIMA (e.g., stock codes MWG, FRT, VIC, and VNM)
Table 3.3 Evaluation Metrics & Comparison Inputs
Table 3.3 compares RMSE and MAE values for the same LSTM models Predictions were made using data collected based on its stock “Closing Price” has better results than “Closing Price and Volume” Lower values in these metrics indicate better predictive performance For most listed stocks, the LSTM model with input only “Closing Price” tends to have lower RMSE and MAE values
Conclusion and Ideas for Future Works
Based on this dataset, LSTM appears more accurate in predicting stock performance, as indicated by generally lower RMSE and MAE values for most stocks So I can conclude that LSTM has better performance in predicting stock prices However, in some cases, the ARIMA model may outperform LSTM in terms of RMSE and MAE (e.g., stock codes MWG, FRT, VIC, and VNM) Therefore, it can be concluded that the choice between using LSTM or ARIMA may depend on specific circumstances
In this topic, RMSE and MAE performance indicators are used to analyze the prediction results of different models and stocks Both ARIMA and LSTM models can predict stock prices, and the prediction results are generally consistent with the actual results
Compared with ARIMA, LSTM performs better in predicting stock prices Both ARIMA and LSTM have different prediction effects on different stocks even if the same model is used Although ARIMA‟s stock price prediction performance is not as good as LSTM‟s, its training time is short, and training parameters are few Besides, this topic only tests and analyzes the ARIMA and LSTM models, and finds a certain time lag between the predicted results of the stock price and the actual results of both models
In comparing LSTM models predicting stock prices, using only "Closing Price" as input yields better results (lower RMSE and MAE values) than using both "Closing Price" and "Volume." This suggests that for most listed stocks, the LSTM model with "Closing Price" input has higher predictive performance.
To enhance the accuracy of stock price forecasting, future research should explore alternative models beyond ARIMA and LSTM Time lag issues in forecasting should be addressed, and novel techniques should be employed to minimize the deviation between predicted and actual data This will improve the usability and reliability of stock price forecasting models.
WEB APPLICATION
Application System Design
Features and Reports
Predictions and Compare Results of Models
With the assumption that today is 2024-05-17 (Friday), I would like to predict the FPT stock closing price trending in the next 7 days To do that, I will choose predict with the option “Next 7 Days” and view the results below
4.1 Predicted closing price with LSTM model
Table 4.1 Results compare predicted closing price between LSTM and ARIMA model
The last trading date is 2024-05-17 with an actual price is: 134,500 VND
The predicted price in the next 7 days is: 134,372 VND
The trending prediction may be a downtrend in the next 7 days
Table 4.2 Results in the predicted closing price with LSTM model
4.2 Predicted closing price with ARIMA model
The last trading date is 2024-05-17 with an actual price is: 134,500 VND
The predicted price in the next 7 days is: 137,222 VND
The trending prediction may be a downtrend in the next 7 days
Table 4.3 Results in the predicted closing price with ARIMA model
4.3 Compare predicted price between LSTM and ARIMA model
Figure 4.2 Plot compare predicted price between LSTM and ARIMA model
Based on Figure 4.2, I consider that the FPT stock closing price trending prediction may be a downtrend in the next 7 days (with the last trading date in 2024-05-17)
[1] T J Strader, J J Rozycki, T H Root, and Y.-H Huang, "Machine learning stock market prediction studies: Review and research directions," Journal of International Technology and Information Management, vol 28, no 4, Article
3, 2020 Available: https://scholarworks.lib.csusb.edu/jitim/vol28/iss4/3 [Accessed: Jun 2024]
[2] M B Malik, T Arif, S Sharma, S Singh, S Aich, and H.-C Kim, "Stock market prediction using machine learning techniques: A decade survey on methodologies, recent developments, and future directions," Electronics, vol
10, no 21, Article 2717, Nov 2021 Available: https://www.mdpi.com/2079- 9292/10/21/2717 [Accessed: Jun 2024]
[3] P BS and M Shastry PM, "Stock price prediction using LSTM," The
Mattingley Publishing Co., Inc., vol 83, pp 5246-5251, May 2020 Available: https://www.researchgate.net/publication/348390803_Stock_Price_Prediction_ Using_LSTM [Accessed: Jun 2024]
[4] S Prabhakaran, "Train test split - How to split data into train and test for validating machine learning models?,"Machine Learning Plus Available: https://www.machinelearningplus.com/machine-learning/train-test-split/
[5] D Shah, H Isah, and F Zulkernine, "Stock market analysis: A review and taxonomy of prediction techniques," Journal of Risk and Financial Management, vol 7, no 2, Article 26, May 2019 Available: https://www.mdpi.com/2227-7072/7/2/26 [Accessed: Jun 2024]
[6] X Ji, J Wang, and Z Yan, "A stock price prediction method based on deep learning technology," International Journal of Computational Science and Engineering, vol 5, no 3, Mar 2021 Available: https://www.emerald.com/insight/content/doi/10.1108/IJCS-05-2020-
[7] S Prabhakaran, "ARIMA model - Complete guide to time series forecasting in
Python," Machine Learning Plus, Aug 2021 Available: https://www.machinelearningplus.com/time-series/arima-model-time-series- forecasting-python/ [Accessed: Jun 2024]
[8] G Sonkavde, D S Dharrao, A M Bongale, S T Deokate, D Doreswamy, and S K Bhat, "Forecasting stock market prices using machine learning and deep learning models: A systematic review, performance analysis and discussion of implications," Journal of Risk and Financial Management, vol
11, no 3, Article 94, Jul 2023 Available: https://www.mdpi.com/2227- 7072/11/3/94 [Accessed: Jun 2024]
[9] V Kumar, "Hands-on guide to LSTM recurrent neural network for stock market prediction," Developers Corner, Mar 2020 Available: https://analyticsindiamag.com/hands-on-guide-to-lstm-recurrent-neural- network-for-stock-market-prediction/ [Accessed: Jun 2024]
[10] A L Schaffer, T A Dobbins, and S.-A Pearson, "Interrupted time series analysis using autoregressive integrated moving average (ARIMA) models: A guide for evaluating large-scale health interventions," BMC Medical Research Methodology, vol 21, no 1, Article 12874, Mar 2021 Available: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-021-01235-8 [Accessed: Jun 2024]
[11] C Olah, "Understanding LSTM networks," Colah's blog, Aug 2015 Available: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[12] P Sharma, "Stock market prediction using machine learning," Analytics
Vidhya, Mar 2024 Available: https://www.analyticsvidhya.com/blog/2021/10/machine-learning-for-stock- market-prediction-with-step-by-step-implementation/ [Accessed: Jun 2024]
[13] P Pathak, "Stock market price trend prediction using time series forecasting,"
Analytics Vidhya, Jul 2022 Available: https://www.analyticsvidhya.com/blog/2020/11/stock-market-price-trend- prediction-using-time-series-forecasting/ [Accessed: Jun 2024]
[14] A Singh, "Stock prices prediction using machine learning and deep learning,"
Analytics Vidhya, May 2023 Available: https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price- machine-learningnd-deep-learning-techniques-python/ [Accessed: Jun 2024]
[15] J Brownlee, "How to create an ARIMA model for time series forecasting in
Python," Machine Learning Mastery, Nov 2023 Available: https://machinelearningmastery.com/arima-for-time-series-forecasting-with- python/ [Accessed: Jun 2024]
[16] C Nantasenamat, "Host your Streamlit app for free," Streamlit’s Blog, Jan
2023 Available: https://blog.streamlit.io/host-your-streamlit-app-for-free/ [Accessed: Jun 2024]
[17] S Datta, "How to get started with Firebase using Python," FreeCodeCamp,
Feb 2021 Available: https://www.freecodecamp.org/news/how-to-get-started- with-firebase-using-python/ [Accessed: Jun 2024]
Tasks Start Date Due Date Working
Day Overall Time 04-Sep-2023 12-May-2024 165
I Planning, General Introduction & Research 04-Sep-2023 29-Sep-2023 20
Statement problem, objective, and scope of the topic 04-Sep-2023 15-Sep-2023 10
Theoretical basic and related research, articles, posts 18-Sep-2023 29-Sep-2023 10
Data collection & description, Data cleaning & normalization, Exploratory Data Analysis (EDA),
Try and choose methods to split datasets: Training,
Testing, and Validation datasets 16-Oct-2023 27-Oct-2023 10 III Implementation, Evaluation Metrics, and
LSTM model: Related research 06-Nov-2023 10-Nov-2023 5
LSTM model: Code and fix bugs 13-Nov-2023 24-Nov-2023 10
LSTM model: Hyperparameters tuning 27-Nov-2023 01-Dec-2023 5
ARIMA model: Related research 04-Dec-2023 08-Dec-2023 5
ARIMA model: Code and fix bugs 11-Dec-2023 22-Dec-2023 10
ARIMA model: Parameters tuning 25-Dec-2023 29-Dec-2023 5
Evaluation Metrics and Comparison Models 02-Jan-2024 05-Jan-2024 4
IV Develop a Web Application 08-Jan-2024 29-Mar-2024 45
Related research: Streamlit and Firebase 08-Jan-2024 19-Jan-2024 10
Front-end and Back-end development 22-Jan-2024 02-Feb-2024 10
Front-end and Back-end development 26-Feb-2024 29-Mar-2024 25
V Submit Master's Thesis 01-Apr-2024 12-May-2024 30
Review all Tasks and Codes 01-Apr-2024 12-Apr-2024 10
Write and edit the Final Report 15-Apr-2024 12-May-2024 20