1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp Hệ thống thông tin: Analyzing and forecasting the unit sales of Walmart retail goods

68 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Analysis and Forecasting the Unit Sales of Walmart Retail Goods
Tác giả Ta Thi Kim Binh, Trieu Kim Ngan
Người hướng dẫn PhD. Do Trong Hop
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Graduation Thesis
Năm xuất bản 2022
Thành phố Ho Chi Minh City
Định dạng
Số trang 68
Dung lượng 22,71 MB

Cấu trúc

  • Chapter 1. IntroduCfiOn.................... ¿5-5255 SE 2 1227121211111... re 5 (18)
    • 1.1 OV€TVICW................... LH TH HH HH TH. 1.001 p1. 5 1.2. Related WOFÍK...................... S4 tt HT. 1 011 0T 1 00H. 0101 tre. 5 1.2.1. Forecasting Indonesia exports using a hybrid model ARIMA-LSTM (0)
      • 1.2.2 Time series forecasting the financial budget using ARIMA and (19)
    • 1.3. Proposed approach.......................----¿- ô+2 k9 11 1111211111 11010. 01.1.0101 10 11g. 7 (20)
      • 1.3.1 Process OV€TVI€W...................... St TT HH HH”... g0 re. 7 (0)
      • 1.3.2 Forecasting DTOC€§S....................... tt TH HH Hit 8 (0)
  • Chapter 2. Theoretical BasiS.......................... -- - St ST E1 2 E1 HH1 0g H1 001.0. 10 (23)
    • 2.1 Exploratory data analysis (ElDA)).......................-- - 5+ key 10 (23)
    • 2.2 ANOVA (Analysis of Variance) testing method........................... --- - -++s+s+x+x+x++ 12 (25)
    • 2.3 Measures of forecast error: RMSE ........cecceeceeseeseseeeseseeseseeeseseeeeeseeeeeteeeaeeees 13 (26)
    • 2.4 Machine learning..........................-- - - + s5 xxx. 1121 1111111111 101tr 14 (27)
      • 2.4.1 ARIMA model .........................----- + St t2 22121 1121011111111 tre 18 (31)
    • 2.5 Deep learning an (35)
      • 2.5.1 LSTM model (44)
  • Chapter 3. Experiment and Discussion ........c.:.ccsceeseseseseseseseseseseseseseseseeeseeeeeseeeeneeereeees 34 n9 (47)
    • 3.1.2 Approach analysis dafaSef......................... ô+ nh HH. vết 38 (51)
    • 3.2 Setup environment (51)
      • 3.2.1 Information of device... .38 3.3 Data analysis results (51)
      • 3.3.2 Units sold for each product category according months (55)
      • 3.3.3 Units sold for each product category according weekday (57)
      • 3.3.4 Units sold for each product category according event (58)
      • 3.3.5 Units sold for each product category according stOre (59)
      • 3.3.6 Units sold for each product category according sfate (60)
    • 3.4 Building forecasting model reSuẽfS.......................--- ¿+ 5252 2 Ê++keÊvrtzxerrkererxee 48 (61)
  • Chapter 4. Conclusion and Future WOFÍK....................... -- - 5+ 5t+t+Et+v+Evxrterrrrxrrerxerrrrrerrrrre 52 (65)

Nội dung

Profit prediction using ARIMA and LSTM models in time series 1.3.. Inaddition, we presented the proposed approach using Exploratory Data Analysis EDA andANOVA testing statistical method

IntroduCfiOn ¿5-5255 SE 2 1227121211111 re 5

Proposed approach . ¿- ô+2 k9 11 1111211111 11010 01.1.0101 10 11g 7

We will introduce the efficient and simple approach that we propose to the problem.

(1) Load unit sales of Walmart in USA dataset, organized in the form of grouped time series.

(2) Using EDA (Exploratory Data Analysis) to analyze the data using visual techniques It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.

(3) Using ANOVA (Analysis of Variance) test to find out whether there exists a statistically significant difference between the mean values of more than one group.

Dataset EDA Anova testing Build Result predictive

A collection of techniques is used in the forecasting process to estimate sales. Following the determination of the goal, it is started The number of personnel to be hired as well as the sales volume in dollars may be included the outcomes of forecasting, such as sales information or the number of staff to be hired in the following year The market factor covers elements including a product's availability in a store, its caliber, and consumer desire An expression of a market factor as a percentage relative to a base content is called a market index Sales in the industry increase when the market index rises The index takes into account a wide range of market variables, including pricing, local population, and disposable personal income The methods for predicting and data analysis are then decided upon during the forecasting process The company might want to test the procedures if they had never been utilized before The next step is data collection and analysis Regarding the anticipated sales, a few assumptions are made. The sales prediction is then finalized as time goes on, and the outcomes are assessed.

After the data analysis processing step, we proceed to build the predictive model byARIMA of machine learning and LSTM of deep learning Next, we use RSME(root-mean-square error) metric for comparative analysis to see which model is superior Specifically, based on the RSME value as the error number, the model with the smaller error is the better and the difference between the expected value and the actual value Finally we obtain the best model to use to predict future values.

Theoretical BasiS - St ST E1 2 E1 HH1 0g H1 001.0 10

Exploratory data analysis (ElDA)) . - 5+ key 10

Load Data maa Sky% all” os wll”

Feature Engineering Detect Outliers Detect Aberrant & Missing Values

Figure 2 1 The fundamental steps of the exploratory data analysis process

After gathering the data, we must comprehend and define its features One It is crucial to identify which data features are deemed independent variables and which are dependent variables in this step By identifying the independent and dependent variables, we may more clearly decide which path to take our analysis in the steps that follow.

After defining the features, we move on to individually assess each feature In order to understand the statistical aspects of each specific feature, such as distribution, range, maximum, minimum, mean, variance, deviation (skewness), and kurtosis, we typically use graphs and basic statistical tools at this stage Visual techniques utilizing

10 charts and systematic techniques In this step, straightforward descriptors are frequently employed.

In the analysis process, data exploration, this step is the most crucial, challenging, and time-consuming The data that is gathered typically has a lot of features Some characteristics are regarded as independent variables Some characteristics are thought of as dependent variables Typically, a thorough grasp of how the independent factors affect the dependent variables is necessary for the data analysis Furthermore, we must determine whether there is any interaction between the independent variables There are two sub steps in this step. e Analysis of two variables: We will examine each independent variable's impact on the dependent variable separately in this stage Graph-based visual techniques and statistical hypothesis testing methods in this step, millet is frequently used. e Multivariate analysis: Numerous independent factors with varying degrees of effect frequently have an impact on a dependent variable We must thus examine these effects in order to comprehend the data Example: If you need to determine if there is an interaction between the independent variables or not, you must determine which independent variable has the least influence, which independent variable has the most influence, and which independent variable has no influence at all intuitive techniques We may immediately see how one or more independent factors affect and interact with the dependent variable by using two-dimensional charts In some circumstances, a 3-D graph will assist us in determining the impact of three independent variables on the dependent variable When there are additional variables, we can either print different histograms or employ statistical techniques like multivariate analysis of variance. Step 4:

This stage involves checking the data and finding any missing information After data collection, it is simple to identify missing data But we can only cope with missing

11 data if we have some comprehension of the data For example, after understanding the distribution of data whether we can decide whether to fill the missing data with the mean or not Or through observe the relationship of the attributes in the data, we can use interpolation method to fill in the price missing treatment Anomaly detection can also be performed at this step By direct visualizing data, we can detect anomalies in the data. Note these anomalies There may or may not be outliers in the data.

We will have more information to use in deciding whether a point is an outlier or not once we have a better grasp of the distribution of the characteristics and the relationships between the variables The typical method for identifying these sites is visual analysis Then, we must recalculate using certain values to determine whether or not those points must be an anomaly If we classify them as outliers, we can eliminate them from the data so that the remaining data set is more readily available for analysis, classification, and prediction processes in model building.

This is a phase that can be used in the workflow of exploratory analysis We can identify some variables that need to be transformed or normalized through the earlier processes For instance, rather than using a variable's initial value, we can use its inverse.Alternately, we can alter an attribute's scale by using its value log value Data transformation will improve the data's readiness for more in-depth analysis or make it simpler to create categorization and predictive models.

ANOVA (Analysis of Variance) testing method - - -++s+s+x+x+x++ 12

An ANOVA test is a sort of statistical analysis that checks for variance-based mean differences to see if there is a statistically significant difference between two or more category groups The independent variable is divided into two or more groups by ANOVA, which is another important component For instance, one or more groups might be predicted to have an impact on the dependent variable, whereas another group might be employed as a control group and not be predicted to have an impact.

Measures of forecast error: RMSE cecceeceeseeseseeeseseeseseeeseseeeeeseeeeeteeeaeeees 13

Root-Mean-Square-Error or RMSE is one of the most popular measures to estimate the accuracy of our forecasting model's predicted values versus the actual or observed values while training the regression models or time series models is the standard deviation of the residuals (prediction errors) Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how to spread out these residuals are In other words, it tells you how concentrated the data is around the line of best fit Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.

€0 statisticshowto.com/rmse-root-mean-square-error

Figure 2 2 Residuals on a scatter plot Formula:

Where: se m=number of observations/rows

Machine learning - - + s5 xxx 1121 1111111111 101tr 14

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy Machine learning-based products like Netflix's recommendation engine and self-driving cars have been made possible in recent years by technical advancements in storage and processing capability The rapidly expanding discipline of data science includes machine learning as a key element Algorithms are trained to create classifications or predictions and to find significant insights in data mining projects through the use of statistical approaches The decisions made as a result of these insights influence key growth indicators in applications and enterprises, ideally. Data scientists will be more in demand as big data continues to develop and flourish. They will be expected to assist in determining the most pertinent business questions and the information needed to address them.

The learning system of a machine learning algorithm into three main parts.

1 A Decision-Making Process: Machine learning algorithms are typically used to produce a forecast or classify something Your algorithm will generate an estimate of a pattern in the input data based on certain data, which may be labeled or unlabeled.

2 An Error Function: An error function rates the model's forecast If there are known instances, an error function can compare them to determine how accurate the model is.

3 A model optimization process: If the model can more closely match the training set's data points, weights are modified to lessen the difference between the known example and the model prediction The algorithm will continue this "assess and optimize" cycle, automatically updating weights, until a predetermined accuracy threshold has been reached This

"evaluate and optimize" procedure will be repeated by the algorithm,

14 with weights being updated automatically, until a predetermined level of accuracy is reached.

Machine learning models fall into four primary categories.

1 Supervised machine learning: which is also used to refer to supervised machine learning, refers to the process of teaching algorithms to correctly classify data or predict outcomes using labeled datasets The model modifies its weights as input data is fed into it until it is well fitted This happens as part of the cross-validation process to make sure the model does not fit too well or too poorly A common example of how supervised learning aids companies is by classifying spam in a distinct folder from your email Neural networks, naive bayes, linear regression, logistic regression, random forest, and support vector machines are a few techniques used in supervised learning (SVM).

2 Unsupervised machine learning: commonly referred to as unsupervised machine learning, analyzes and groups unlabeled datasets using machine learning algorithms These algorithms identify hidden patterns or data clusters without the assistance of a human This approach is suitable for exploratory data analysis, cross-selling tactics, consumer segmentation, and picture and pattern recognition since it can identify similarities and differences in data Additionally, dimensionality reduction is used to lower the number of features in a model Two popular methods for this are singular value decomposition (SVD) and principal component analysis (PCA) In unsupervised learning, neural networks, k-means clustering, and probabilistic clustering techniques are other algorithms that are used.

3 Semi-supervised learning: offers a happy medium between supervised and unsupervised learning During training, it uses a smaller labeled data set to guide classification and feature extraction from a larger, unlabeled data set Semi-supervised learning can solve the problem of not having

15 enough labeled data for a supervised learning algorithm It also helps if it’s too costly to label enough data.

4 Reinforcement learning: The machine is given the correct answers, and it learns by looking for patterns in all of the correct answers A list of permissible actions, rules, and probable end states are inputted rather than an answer key in the reinforcement learning approach Machines can learn by doing when the algorithm's desired aim is fixed or binary. However, when the desired goal is variable, the system must learn through reward and experience In models of reinforcement learning, the

"reward" takes the form of a monetary value that is programmed into the algorithm as something the system aims to acquire.

How machine learning works e Machine learning is comprised of different types of machine learning models, using various algorithmic techniques Four learning models like supervised, unsupervised, semi-supervised, or reinforcement can be utilized, depending on the type of data and the desired result Depending on the data sets being used and the desired outcomes, one or more algorithmic strategies may be used within each of those models In essence, machine learning algorithms are made to categorize objects, look for patterns, forecast results, and make conclusions When dealing with complicated and more unexpected data, it is feasible to employ one algorithm at a time or combine several algorithms to get the highest level of accuracy.

16 ooo ooo ooo ooo ooo ooo ooo ooo

Input data ——+ Develop model ——+ Trainmodel ——+ Test and analyze ——+ [Model goes live

Figure 2 3 The machine learning process works sap.com/insights/what-is-machine-lcarning

Common machine learning algorithms e A number of machine learning algorithms are commonly used including:

1 Neural networks: which include a vast number of connected processing nodes, mimic how the human brain functions Natural language translation, picture identification, speech recognition, and image generation are just a few of the applications that benefit from neural networks' aptitude for pattern detection.

2 Linear regression: Based on a linear relationship between various values, this technique is used to forecast numerical values The method might be applied, for instance, to forecast housing values based on local historical data.

3 Logistic regression: The "yes/no" responses to questions are categorical response variables, and our supervised learning method predicts them Applications for it include sorting spam and performing quality control on a production line.

4 Clustering: Clustering algorithms can find patterns in data to group it via unsupervised learning Data scientists can benefit from computers’ ability to spot distinctions between data points that humans have missed.

5 Decision trees: can be used to categorize data into categories as well as forecast numerical values (regression) A tree diagram can be used to show the branching sequence of connected decisions used in

17 decision trees In contrast to the neural network's "black box," decision trees are simple to validate and audit, which is one of their benefits.

6 Random forests: In a random forest, the machine learning algorithm predicts a value or category by combining the results from a number of decision trees.

The ARIMA model is created using the previously outlined procedure The ARIMA model's flow is depicted in Figure 6 The series is examined to see if it is stationary, and if not, various transformations are used to make it stationary The values for the terms p, d, and q (model parameters) are then determined after plotting the Auto Correlation Function (ACF) and Partial ACF (PACF) graphs ACF retrieves auto correlation values from a series that also includes lagged values The ACF plot, which illustrates the strength of the association between the current value of the series and its prior values, will be created by graphing these values with the confidence band The ACF function determines the correlation based on a time series’ trend, seasonality, cyclicity, and residual factors Instead of locating current lagged values like the ACF, the PACF retrieves the correlation of the residuals with the following lagged value. Additionally, the model fit is carried out in three stages: first, the AR model, then the MA model, and last, their combination to produce ARIMA On the basis of validation data, the model is utilized to create predictions The models! error and accuracy are then examined and assessed.

| 4 Plot the data Identify unusual observations.

3 necessary, ce the data until it appears stationary Use unit-root

4 Plot the ACF/PACF of ` y ithe differenced data and try to determine possible candidate models. Ỷ (6 Check the residuals.

5 Try chosen model and from chosen model by use the AIC, to search for: plotting the ACF of the a better model ‘residuals and doing a test ơ \_ ofthe residuals _ /

Forecasting Indonesia Exports using a Hybrid Model ARIMA-LSTM

Figure 2 4 Flow Chart of ARIMA Model

1) Training and Validation - ARIMA Model

A 4:1 split of the dataframe into training and validation datasets (80% train and 20% valid datasets) has been made The validation data was used in prediction to assess the model's accuracy after it had been constructed using train data Statistical operations have been carried out on data subsets using window functions Rolling functions can calculate fresh values for each row in the dataframe A desired calculation can be done on the rows in a window, which is a subset of the dataset A minimum window size can be defined An updated average value is produced by calculating the window rolling mean or moving average for each row in the given window Window rolling standard deviation is applied similarly Window rolling standard deviation is applied similarly.

The 24-hour window has been selected Plots show the calculated Window rolling mean and Window rolling standard deviation for a window size of 24.

2) Tests for Stationarity - ADF and KPSS Tests

Deep learning an

Deep learning is a subset of machine learning, which is essentially a neural network with three or more layers These neural networks make an effort to mimic how

22 the human brain functions, however they fall far short of being able to match it, enabling it to "learn" from vast volumes of data Additional hidden layers can help to tune and refine for accuracy even if a neural network with only one layer can still make approximation predictions Many artificial intelligence (Al) apps and services are powered by deep learning, which enhances automation by carrying out mental and physical tasks without the need for human intervention Deep learning is the technology that powers both established and emerging technologies, like voice-activated TV remote controls, digital assistants, and credit card fraud detection (such as self-driving cars).

How does deep learning in radiology work e The structure of a deep neural network e We must comprehend every component of a neural network in order to better comprehend how to construct a deep neural network Each layer in a neural network is made up of nodes There is an input layer, which may include the entire image or just a portion of it The function of retrieving picture features is then performed by numerous hidden layers The output layer completes the process by providing the network's intended response.

VÀ S9 S2) The tumor in the

The image layers layer 3 Quantib quantib.com/blog/how wii adicigt

Figure 2 5 A deep neural network consists of an input layer, multiple hidden layers and an output layer, all consisting of nodes

An input and an output are calculated by a node, which represents a simple calculation The output from each node in the layer before serves as the input (hence the connecting lines between nodes The output is calculated within the node and then forwarded to the following layer Simply determining the best output for each node in your deep neural network during training will ensure that the network responds correctly when all nodes are joined.

What happens in a neural node?

3£ Quantib quantib.com/blog/how-does-deep-learning-work-in-radiology

Figure 2 6 A node in a hidden layer of a deep neural network takes an input, performs a calculation, and passes on an output to nodes in the next layer Two steps make up the calculation: (1) a straightforward multiplication of all the inputs by the relevant "weight," and (2) the activation function Each input is multiplied by a corresponding weight in the first section, and the results are added. The activation function is a little trickier in the second portion It enables the network to carry out more difficult calculations and hence address more difficult issues The output of the node is transmitted to several nodes in the following layer after completing these two processes (to whom it becomes an input) For an illustration of how a neural node calculates Every node in the first hidden layer is connected to every node in the second, which is connected to every node in the third hidden layer, etc., in a typical neural network's hidden layers.

The calculation within a neural node

‘The output is passed on to the next node, becoming one of the inputs

The calculation in the second layer

The calculation in the first layer

Each node \ in the second layor Teceives input from each node in the frst layer

Etc quantib.com/blog/how-does-deep-learning-work-in-radiology

Figure 2 7 Each neural node in a deep neural network performs a calculation by taking inputs from nodes in the previous layer Three steps to creating a neural network e The weights of a neural network directly affect how input data is converted and transmitted from one node to the next, as we saw in the previous section Thus, the weights ultimately determine the output's value Training is the process of identifying the weights' values that produce the desired results See Figure 2.7 to get a sense of how many weights we are dealing with.

How many weights are there in a neural network?

Input layer Hidden layer Hidden layer

1000 12000 _, 12 14 12 24 nodes “ weights nodes * weights nodes *~ weights

3⁄2 Quantib quantib.com/blog/how-does-deep-learning-work-in-radiology

Figure 2 8 Total weights in neural network Step 1: Initial training of the network using the training set e The first step is training of the network, i.e taking a go at determining the value of each weight This is done using backpropagation. e Assigning all weights to random values is the first step in training the network via backpropagation The network is then applied to each image in the training set.

The network's output, which at this point will be entirely random, is then contrasted with our ground truth An error can be defined as the discrepancy between the output and the actual data For instance, our network is expected to

27 determine the likelihood that a tumor is present in our image The network responds with "32% possibility of a tumor" when we insert an image with a tumor We can estimate a 68% error by comparing this response to the ground truth, which in this case is "100% risk of a tumor," which shows how far off the algorithm is.

Example: Calculating the cost of a neural network

Network output Ground truth Error

\.\ 7 probability the 32% image contains ý Mex! ý = " "

Ya y the ground truth tells us ee ‘A there is actually a 100% a AI N chance the image oN 68% contains a tumor

Input layer Hidden Output image does not contain a tumor probability the

3 Quantib quantib.com/blog/how eep-learning-work-in-ra¢

Figure 2 9 Calculation of error of one image e To acquire the network's overall error over the training set, which is nothing more than adding up the errors for all the images in the training set, the errors of each example in the training set are then merged We must reduce this overall error if we want our network to operate at its best Calculate backpropagation at this stage.

To do this, all weights must be adjusted, starting at the output layer and moving backward to the input layer Gradient descent is a regularly used technique to iteratively alter the weights, gradually approaching the right values.

Step 2: Refining the network using the validation set

When used on the training set, the training technique described above will produce a neural network with extremely little error Roughly speaking, we can say that a network learns from the training set more effectively the deeper it is, that is, the more hidden layers and nodes it has However, it is possible that the network overfits, or learns excessively from the provided samples As a result, the network performs admirably with

28 the training set but poorly with fresh data In other words, the performance of the network cannot be extrapolated There are various methods for doing this For illustration, you could: e Use more data This is easier said than done, because usually if you had more data at the start, you would have used it already. e Apply augmentation If there is not enough original data available, it can be created in an artificial way Take the original dataset and transform the images such that they are different, but still resemble a credible example This transformation may be cropping, translating, rotating, resizing, stretching, or randomly deforming them We will discuss augmentation in a future blog post! e Apply drop out This method is applied during training of the network by ignoring random neurons with each iteration during training, i.e “turning them off’ By doing this, other neurons will pick up the task of the ignored neurons Hence, instead of the network becoming too dependent on specific weights between specific neurons, the interdependencies distribute themselves over the whole network. e Apply regularization to the weights By introducing other conditions or extra controls on the weights, or actually the total of all weights, you can steer the network away from overfitting.

4 ways to improve the neural network in validation phase

More data Augmentation Drop out Regularization quantib.com/blog/how-does-deep-learning-work-in-radiology

Figure 2 10 Four ways to improve the neural network.

Step 3: Using the test set

Testing, the final step, makes use of a set of data that has been kept separate This stage is relatively simple: once the system processes the test set's images, performance indicators like accuracy, recall, or DICE score are determined Consequently, this is an impartial evaluation of the algorithm's performance on brand-new data There is no turning back after the algorithm performance has been evaluated using this test set After all, modifying the algorithm in response to the test results would render the test set completely unrelated to the network training.

Al vs Machine Learning vs Deep Learning

Machine learning and deep learning fall within the larger category of Artificial Intelligence Furthermore, deep learning is a subset of machine learning, as the diagram shows Therefore, machine learning, deep learning, and AI are all merely subsets of one another So let's continue and learn more about how they differ from one another You may learn deep learning using TensorFlow through a variety of deep learning course programs.

A technique which enables machines to mimic human behaviour co]

Subset of Al technique which use fi statistical methods to enable machines x to improve with experience

DEEP LEARNING me poi ene ee Subset of ML which make the computation of multi-layer neural network feasible

Figure 2 11 Comparison AI vs Machine Learning vs Deep Learning earning vs Artificial Neural Network vs Deep Learning | ResearchGate

The three-lane activity swimlane diagram of the LSTM model is shown in Figure 2.12. Data input is shown in the first lane, followed by data cleaning, feature extraction, EDA, and MinMax scaling to fit in the range of (0,1) The second lane shows how an LSTM model that has one layer of LSTM and one layer of Dense is implemented The third lane shows predictions made using the fitted model and evaluated using validation data. predictions for the following five years.

: Deep Learning- Collecting and TM Model Result

A deep learning based traffic crash severity prediction framework

Figure 2 12 Imlane Diagram of LSTM Model

Experiment and Discussion c.:.ccsceeseseseseseseseseseseseseseseseeeseeeeeseeeeneeereeees 34 n9

Approach analysis dafaSef ô+ nh HH vết 38

Figure 3 2 Independent variables and dependent variables

We determined dependent variables (unit_sold) and independent variables (weekday, date, month, year, store, event, state) based on the dataset After, we would analyze the influence of dependent variables on the quantity of products sold, then we use Anova statistical to determine whether they are statistically significant.

Assumptions ANOVA test: One hypothesis Ho is tested e p-value < 0.05: Failed to reject the null hypothesis, the variable is statistically significant. e p-value > 0.05: Reject the null hypothesis, variable is no statistically significant.

Setup environment

3.2.1 Information of device e Device name: LAPTOP-CA0S70MK e Manufacturer: ASUSTeK COMPUTER INC. e Processor: I1 Gen Intel® CoreTM i5-1135G7 @ 2.4GHz 2.42GHz e Installed RAM: 8.00 GB (7.70 GB usable) e OS: Windows 11 Home Single Language e Version: 21H2 e System type: 64-bit OS, x64-based processor

Anaconda offers the easiest way to perform Python/R data science and machine learning on a single machine Start working with thousands of open-source packages and libraries today For Windows e Python 3.9 e 64-Bit Graphical Installer se 621MB The Jupyter Notebook is the original web application for creating and sharing computational documents It offers a simple, streamlined, document-centric experience. Time series analysis and time series forecasting are common data analysis tasks that can help organizations with capacity planning, goal setting, and anomaly detection. Graduate modeling approaches, which were previously exclusively available to those with advanced degrees in statistics, are now more widely accessible to persons with only rudimentary programming abilities thanks to a growing variety of freely available tools. This is especially important for our clients, government agencies, where resources are limited and data-savvy professionals are in high demand I'd like to demonstrate how to use a handful of these tools for you in this blog post A dataset will be loaded into a data

39 frame in a Python Jupyter notebook to begin with Following some data manipulation to get it ready for analysis, some charting, and ultimately using the Prophet package to create a forecast using our data An ordered set of observations is known as a time series, and each observation is made at a certain point in time There are time series data in many different fields We can anticipate seeing time series in any area where we take measurements throughout time In Python it is very popular to use the pandas package to work with time series It offers a powerful suite of optimized tools that can produce useful analyses in just a few lines of code A pandas.DataFrame object can contain several quantities, each of which can be extracted as an individual pandas Series object, and these objects have a number of useful methods specifically for working with time series data. import pandas as pd import numpy as np import matplotlib.pyplot as plt statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration An extensive list of result statistics are available for each estimator The results are tested against existing statistical packages to ensure that they are correct The package is released under the open source Modified BSD (3-clause) license statsmodels supports specifying models using R-style formulas and pandas DataFrames.

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: import statsmodels.formula.api as smf

3.3.1 Units sold for each product category according years

Based on the figure Figure 3 2, After we identify independent variables (weekday, date, month, year, store, event, state) and dependent variables (unit sold).

We will execute the following: e Step 1: We will conduct an exploratory data analysis process base on many variables by the visualization models e Step 2: We will use Anova statistical to determine whether they are statistically significant

Figure 3 3 The visualization model shows units sold for each product category according years

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit sold slowly growth and low bottom from 2011 to the middle of 2013 It has a slight growth from the middle of 2013 onwards The Foods category which considered the category with the highest consumption and is the mainstay of Walmart because of its

41 highest unit sold compared to other categories It has outstanding unit sales at the beginning of 2016.

Assumptions the null hypothesis Ho: “ Time independent variable(year) influence on units sold dependent variable”

53.66027117731279 3.921822380241903e-10 sum sq df F PR(>F) C(catid) 7.061609e+07 20 45.355402 4.363039e-07

Here are the ANOVA results; using this test, a P-value of 4.363039e-07 is smaller 0.05 was obtained Therefore, then the hypothesis is failed to reject, so time independent variable(year) and units sold dependent variable is statistically significant.

3.3.2 Units sold for each product category according months

Figure 3 4 The visualization model shows units sold for each product category t=)

FOO! HOBBIE HOUSEHOI according months

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit solds slowly growth and low bottom It has a slight growth in the first months and the last months The Foods category which considered the category with the highest consumption and is the mainstay of Walmart because of its highest unit sold compared to other categories It has outstanding unit sales in the first months, particularly highest in March To sum up, products are consumed the most in the first months of the year (March, April) and the lowest in the middle months of the year (May, June, July).

Assumption: the null hypothesis Ho: “ Time independent variable(month) influence on units sold dependent variable”

Here are the ANOVA results; using this test, a P-value of 3.154686e-17 is smaller 0.05 was obtained Therefore, then the hypothesis is failed to reject, so time independent variable(month) and units sold dependent variable is statistically significant.

3.3.3 Units sold for each product category according weekday

TOTAL SALES BY WEEKDAY 6000 weekday

Friday Monday Saturday Sunday Thursday Tuesday Wednesday

Figure 3 5 The visualization model shows units sold for each product category ° FOODS tonairs BAN HOUSEHOLD according weekday

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit sold slowly growth and low bottom It has a slight growth at the weekend The Foods category which considered the category with the highest consumption and is the mainstay of Walmart because of its highest unit sold compared to other categories It has outstanding unit sales at the weekend To sum up, products are consumed the most at the weekend (Saturday, Sunday) and the lowest on the days (Thursday, Tuesday).

Assumption: the null hypothesis Ho: “ Time independent variable(weekday) influence on units sold dependent variable” df sum_sq mean_sq F PR(>F)

Here are the ANOVA results; using this test, a P-value of 0.0 is smaller 0.05 was obtained Therefore, then the hypothesis is failed to reject, so time independent variable(weekday) and units sold dependent variable is statistically significant.

3.3.4 Units sold for each product category according event

Figure 3 6 The visualization model shows units sold for each product category according events

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit solds slowly growth and low bottom It has a slight growth at National and Religious events The Foods category which considered the category with the highest consumption and is the mainstay of Walmart because of its highest unit sold compared to other categories It has outstanding unit sales at National and Religious events To sum up, products are consumed the most at National and Religious events and the lowest at Sporting event.

Assumption: the null hypothesis Họ: “Time independent variable(event) influence on units sold dependent variable”

Here are the ANOVA results; using this test, a P-value of 0.039071 is smaller 0.05 was obtained Therefore, then the hypothesis is failed to reject, so time independent variable (event) and units sold dependent variable is statistically significant.

3.3.5 Units sold for each product category according store

TOTAL SALES BY CATEGORY IN EACH STORE store_id mal mm CAR mm AR mAs

Figure 3 7 The visualization model shows units sold for each product category

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit sold slowly growth and low bottom It has a slight growth CA_1, CA_2 and CA_3 stores in California state The Foods category which considered the category with the highest consumption and is the mainstay of Walmart because of its highest unit sold compared to other categories It has outstanding unit sales at CA_3 store in California state To sum

46 up, products are consumed the most stores in California state and the lowest in Wisconsin state.

Assumption: the null hypothesis Ho: “ Time independent variable(store) influence on units sold dependent variable” df sum_sq mean_sq E PR(>F)

Here are the ANOVA results; using this test, a P-value of 0.888018 is largest 0.05 was obtained Therefore, then the hypothesis is reject, so time independent variable (store) and units sold dependent variable is no statistically significant.

3.3.6 Units sold for each product category according state

TOTAL SALES BY CATEGORY IN EACH STATE 14000 § =

Figure 3 8 The visualization model shows units sold for each product category

Foops HOBBIES HOUSEH according state

We base on a figure that finds out units sold of the Hobbies category is lowest and units sold of the Foods category is highest over 5 years For the Hobbies category, unit sold slowly growth and low bottom It has a slight growth in California state The Foods category which considered the category with the highest consumption and is the

Building forecasting model reSuẽfS . - ¿+ 5252 2 Ê++keÊvrtzxerrkererxee 48

We used the ARIMA and LSTM algorithms to anticipate the evolution of the sold unit.

In this section, we assess the two models created to forecast changes in the number of units sold.

The effectiveness of two time series prediction algorithms is compared in this study Predicting the number of units sold over the next 28 days is the objective After creating the stationary series, we applied the ARIMA model with several settings; the random walk ARIMA was the best model to keep Then, using various parameter values, we created an LSTM architecture, with two LSTM blocks located in the hidden layer being the ideal arrangement.

Original vs predicted zoo ew ici im Sa eo ie BTM Sm 154 Fae me ST aR Sam Naw 67H NTR Fae 9K ADs SOM Ha Sw Row SO Soa FIM Som bes em OAR Fea NAR Gee SOSH

Figure 3 10 Actual vs Predicted on Validation Data— ARIMA Model (810-846)

5 rơ r3 sa sa ry sim Pry s Figure 3 11 Actual vs Predicted on Validation Data — ARIMA Model (991-999)

Figure 3 12 Actual vs Predicted on Validation Data— LSTM Model (5 years)

Figure 3 13 Actual vs Predicted on Validation Data— LSTM Model (7/2014-3/2016)

The RMSE computation, a method for determining the difference between projected values and values reported in the data set, was used to assess the models The two models that we created based on the RMSE are compared in Figure 28 Recall that the statistical metric RMSE (the root of the squared mean error) is frequently used to assess the model's accuracy and determine the discrepancy between real dataset values and values predicted by models There is a tiny variation between the two RMSEs, with the LSTM model having a lower RMSE Therefore, it is clear that the LTSM architecture performs better than ARIMA and selected to forecast for the future 28 days.

RMSE error vs Model mm RSME

Figure 3 15 Forecasting model in the future — LSTM model

Conclusion and Future WOFÍK - 5+ 5t+t+Et+v+Evxrterrrrxrrerxerrrrrerrrrre 52

In this thesis, the non-linearity, non-stationarity, and volatility properties of financial time series data make it difficult to define an ideal model for forecasting this sort of data We have also succeeded in applying EDA (Exploratory Data Analysis) techniques to analyze influential variables and ANOVA (Analysis of variance) statistical methods to retest variables Then, we built two forecasting models ARIMA (Autoregressive Integrated Moving Average) and LSTM (Long Short- Term Memory) to find the best model After we calculate RMSE (root-mean-square error) metric on two models to compare two forecasting models for financial time series We obtain accuracies of 0.779(ARIMA) and 0.643 (LSTM) approximately are observed Forecasts for the following five years have been made using the built-in models The best model is built using LSTM, according to the results Walmart can expand its reach by investing in new regions or opening additional stores if its financial position is solid But one of the biggest benefits of its financial strength is its capacity for loss-taking Walmart can modify prices for sluggish-moving goods to increase consumer demand, while doing so lowers earnings However, they assist businesses in getting rid of excess inventory so they can import new goods Strong financial potential also helps organizations become more competitive For example, by lowering its prices and suffering losses, Walmart is able to somewhat halt the growth of many potential rivals Additionally, this benefit enables Walmart to continue offering the finest return policy in the industry Walmart is willing to accept losses in order to cultivate consumer loyalty, in contrast to many other brands where returns are always carefully scrutinized in order to prevent financial loss.

Although LSTM outperforms both stochastic models in terms of building the optimal model, it is expensive in terms of runtime and computational power if the data used, and a significant amount of repetitions are necessary SARIMA can take its place for the dataset that is larger and less complicated but incorporates seasonality because it only improves accuracy by about 3% It has been discovered that the accuracy of LSTM is independent of the number of epochs employed, increasing or decreasing at random throughout time So once a respectable level of accuracy is attained, it is advisable to stop

52 at the bare minimum of epochs The more time that has passed since the last data point, the less accurate the future forecasts become Future experiments can test a variety of fresh DL models Depending on the data, it is also possible to construct combinations of stochastic and DL models to get additional advantages For sales predictions, we can also create whole web applications or mobile applications, which can aid in company decision-making as a whole Overall, this effort considerably increased our knowledge of sequence models and time series data processing, and we look forward to investigating this work's potential future applications.

1] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee “Conditional time series forecasting with convolutional neural networks” In: (2018) url: ttps://arxiv.org/pdf/1703.0469 1 pdf.

2] Long Chen, Konstantinos Ampountolas, and Piyushimita (Vonu) Thakuriah.

“Predicting Uber Demand in NYC with Wavenet” In: (2020) url: ttp://eprints.gla.ac.uk/199034/.

3] Emmanuel Dave, Albert Leonardo, Marethia Jeanice, Novita Hanafiah

“Forecasting Indonesia Exports using a Hybrid Model ARIMA-LSTM” url: ttps://www.sciencedirect.com/science/article/pii/S 187705092 1000363 ?ref=pdf_do wnload&fr=RR-2&r1wf1e2e1£88e2756

4] Uppala Meena Sirisha, Manjula C Belavagi, Girija Attigeri “Profit Prediction Using ARIMA, SARIMA and LSTM Models in Time Series Forecasting: A Comparison” url: ttps://ieeexplore.ieee.org/abstract/document/9964190

5] Maryem Rhanoui, Siham Yousfi, Mounia Mikram, Hajar Merizak “Forecasting financial budget time series: ARIMA random walk vs LSTM neural network” url: ttps://www.sciencedirect.com/garudal493807.pdf

6] M5 compettion forecasting, url: https://mofe.unic.ac.cy/m5-competition/

7] M Kulkarni, A Jadha and D Dhingra, "Time series data analysis for stock market prediction", Proc Int Conf Innov Comput Commun (ICICC), pp 1-6,

8] M A Rahim and H M Hassan, "A deep learning based traffic crash severity prediction framework", Accident Anal Prevention, vol 154, May 2021, [online] Available: https://www.sciencedirect.com/science/article/pii/S000 1457521001214.

9] G V Attigeri, M M M Pai, R M Pai and A Nayak, "Stock market prediction:

A big data approach", Proc IEEE Region 10 Conf (TENCON), pp 1-5, Nov 2015.

Ngày đăng: 23/10/2024, 00:43

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN