Bài tập lớn Xác suất thống kê ĐH BK

TABLE OF CONTENTS Acknowledgement List of figures Section 1: Introduction Introduction Rationale Object and the range of study Aim of the study Research method Section 2: Time series: Theoretical basis Time Series Decomposition ACF and PACF: ARIMA MODEL: Fit model: 10 Section 3: Application 11 Load libraries and data: 11 Import data: 11 Data cleaning: 11 Invert data to time series model: 11 Time series decomposition 12 Test stationary: 13 ADF test: 13 Autocorrelation (ACF & PACF) 14 Remove trend and seasonal effect 15 ADF test: 15 ACF & PACF test: 15 FIT model: 17 ARIMA Model 17 Forecast 19 Section 4: Conclusion 20 Section 5: R Code 21 Reference 22 Acknowledgement First of all, we would like to express our deep appreciation to Professor Nguyen Tien Dung for giving us the opportunity to work with R studio, an important software in researching statistics We are also grateful that you have conveyed an abundant amount of knowledge about Probability and Statistics to us This is a great chance for us to operate the R studio The software broadens not only our knowledge but also gives us the ideas for future projects List of figures Figure 1: Time series of the data 12 Figure 2: Time series decomposition 13 Figure 3: ACF diagram with trend and seasonality 14 Figure 4: PACF diagram with trend and seasonality 14 Figure 5: ACF diagram without trend and seasonality 16 Figure 6: PACF diagram without trend and seasonality 16 Figure 7: Linear regression model of the data 17 Figure 8: Diagrams of different analysis of residuals for model selection 18 Figure 9: Histogram and Q-Q Plot of the residuals 19 Figure 10: Time series forecast 20 Section 1: Introduction Introduction The objective of this analysis and modelling is to review time series theory and experiment with R packages We will be following an ARIMA modeling procedure of the Mauna Loa CO2 dataset as follows: We use time series in this topic because we want to analyze a series of data measured in each specific moment, and then we can use the trend of data to predict the trend of data in the future Perform exploratory data analysis Decomposition of data Test the stationarity Fit a model used an automated algorithm Calculate forecasts Rationale Since CO2 makes up 77% of greenhouse gas emissions and is the fourth most abundant gas in the Earth's atmosphere, we chose this topic In a normal concentration range, it is a harmless gas that has no color or smell In order to reduce pollution, analysis is necessary By doing so, we can forecast the future trend of CO2 Consequently, using information regarding the rate at which CO2 is rising, we may determine a technique to minimize the amount of CO2 in the air Object and the range of study We choose to analyze the Atmospheric CO2 Levels at Mauna Loa, Hawaii At Mauna Loa Observatory, the atmospheric carbon dioxide content displays a yearly pattern that is remarkably consistent year after year This seasonal signal's amplitude, expressed either as peak-to-peak concentration fluctuations or as a string of harmonic terms Moreover, it also relates to our specialized skills that analyze some chemicals in environment around us Aim of the study A thorough investigation of the calibration procedures and data analysis techniques used throughout this lengthy record fails to find any discrepancies that are significant enough to account for the increase It is likely that at least some of the increase is a result of rising plant activity because the northern hemisphere's yearly cycle of CO2 is assumed to be primarily caused by the metabolic activity of terrestrial plants Research method We summarize all of the information and data on the internet and many reports about the amount of atmospheric co2 at Mauna Loa Then we arrange those data on the table so that make us easier to plot, we also make a small survey to collect more and more information on a lot of websites on the internet Moreover, Rpubs is the place where provide us the exact data relate to our topic then we have to add this data to R studio to plot Section 2: Time series: Theoretical basis A Time Series is a series in statistics, signal processing, econometrics and financial mathematics is a series of data points and it is measured in successive time intervals according to a uniform frequency The purpose of time-series data mining is to try to extract all meaningful knowledge from the shape of data Even if humans have a natural capacity to perform these tasks, it remains a complex problem for computers In this article we intend to provide a survey of the techniques applied for time-series data mining The first part is devoted to an overview of the tasks that have captured most of the interest of researchers Considering that in most cases, time-series task relies on the same components for implementation, we divide the literature depending on these common aspects, namely representation techniques, distance measures, and indexing methods Time Series Decomposition We can decompose the time series into trend, seasonal and error components The additive model is: Y[t]=T[t]+S[t]+e[t] where:  Y(t) is the concentration of co2 at time t,  T(t) is the trend component at time t,  S(t) is the seasonal component at time t,  e(t) is the random error component at time t Classical decomposition of time series is performed using the decompose function In these decomposed plots we can again see the trend and seasonality as inferred previously, but we can also observe the estimation of the random component depicted under the “remainder” ACF and PACF: In order to test the stationarity of the time series, let’s run the Augmented DickeyFuller Test using the adf.test function First set the hypothesis test:  The null hypothesis H0: that the time series is non stationary  The alternative hypothesis HA: that the time series is stationary Where the p-value is less than 5%, we strong evidence against the null hypothesis, so we reject the null hypothesis In this case, if the test results which is >0.05 therefore we accept the null hypothesis that the time series is non stationary A stationary time series has the conditions that the mean, variance and covariance are not functions of time In order to fit arima models, the time series is required to be stationary We will use two methods to test the stationarity Another way to test for stationarity is to use autocorrelation We will use autocorrelation function acf and partial autocorrelation function pacf These functions plot the correlation between a series and its lags ie previous observations with a 95% confidence interval in blue If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) The ACF and PACF are used to figure out the order of AR, MA, and ARMA models The ACF and PACF plots can be obtained from the original data, as well as from the residuals of a model On the original data, these plots can help detect any autoregressive or moving average terms that may be significant in the time series When applied to the residuals, these plots can detect any remaining autocorrelation in the model This also provides insight into whether additional AR or MA terms need to be included in the model Similarly, they can detect any seasonal behaviour that must be accounted for in the model ARIMA MODEL: We know that we need to address two issues before we test stationary series One, we need to remove unequal variances We this using log of the series Two, we need to address the trend component We this by taking difference of the series Now, let’s test the resultant series Differencing is the commonly used technique to remove non-stationarity This differencing is called as the Integration part in AR(I)MA Now, we have three parameters p represents for AR d represents for I q represents for MA An auto regressive (AR(p)) component is referring to the use of past values in the regression equation for the series Y The auto-regressive parameter p specifies the number of lags used in the model The d represents the degree of differencing in the integrated (I(d)) component Differencing a series involves simply subtracting its current and previous values d times A moving average (MA(q)) component represents the error of the model as a linear combination of previous error terms et The order q determines the number of terms to include in the model Seasonality can easily be incorporated in the ARIMA model directly ARIMA stands for Auto Regression Integrated Moving Average It is specified by three ordered parameters (p,d,q):  p is the order of the autoregressive model (number of time lags)  d is the degree of differencing (number of times the data have had past values subtracted)  q is the order of moving average model Due the fact that our times series exhibits seasonality, we will use actually a model called SARIMA, that is, as name suggest, a seasonality ARIMA We write SARIMA as ARIMA(p,d,q)(P, D, Q)m:  p — the number of autoregressive  d — degree of differencing  q — the number of moving average terms  m — refers to the number of periods in each season  (P, D, Q)— represents the (p,d,q) for the seasonal part of the time series Use the auto.arima function to fit the best model and coefficients, given the default parameters including seasonality as TRUE It is frequently used to predict demand, such as when estimating future demand for atmospheric CO2 This is so that managers have solid parameters to follow when making judgments about how to limit pollution Based on historical data, ARIMA models can also be used to forecast how much CO2 our environment will contain in the future Fit model: Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained A model that is well-fitted produces more accurate outcomes A model that is overfitted matches the data too closely A model that is underfitted doesn’t match closely enough A machine learning model's model fitting is a gauge of how well it generalizes to data that is comparable to the data it was trained on We are able to use machine learning algorithms every day to make predictions and classify data because they can generalize a model to fresh data When a model is given unknown inputs, a good model fit is one that closely approximates the outcome of an unknown input The process of fitting a model involves changing its parameters in order to increase its accuracy A machine learning method is applied to data for which the target variable is known ("labeled" data) in order to produce a machine learning model In our case 10 we refuse to use linear model because it does not capture the seasonality and additive effects over time Section 3: Application Load libraries and data: The first thing to is to load the first data set we will use This data set contains observations on the concentration of carbon dioxide (CO2) in the atmosphere made at Mauna Loa from 1958 to 2020 This is an in-built data set in R so can be loaded via the data function library(ggfortify) library(tseries) library(forecast) Import data: ts2 ts1 Invert data to time series model: Dấu $ dùng để trích biến val từ data ts1 co=ts(ts1$val,start = 1958 , end = 2020, frequency = 12) summary(co) ylab

Định dạng
Số trang	21
Dung lượng	554,41 KB