Khóa luận tốt nghiệp Hệ thống thông tin: Analyzing and forcasting Wikipedia web traffic

Web Traffic Time Series Forecasting using ARIMA and LSTM RNN.... CATALOG OF ACRONYMID Acronym Description 1 AR Auto Regressive 2 ACF Auto Correlation Function 3 ADF Augmented Dickey-Full

Thesis Structure

The thesis is divided and organized into six chapters, the main information and content written in each part as follows:

- Chapter 1: Introduction: An introduction of the content of the thesis topic including: problem statement, problem solutions, challenges throughout my journey and finally the goals and scope of my research

- Chapter 2: Related Works: This thesis has been followed and analyze two of research paper from previous authors.

- Chapter 3: Theoretical Knowledge: In this chapter, the thesis mainly concentrates on some essential theoretical knowledge gained through self-study and research from online materials, namely, the definition of time-series, some important terms when working on this field and prediction models.

- Chapter 4: Our Experiments: This chapter goes into discovering the dataset collected by Google and then building ARIMA, Prophet, LSTM Model to predict the web traffic flow.

- Chapter 5: Conclusion & Future Works: Finally, this chapter will summarize all the results from analysis and prediction process and the knowledge of machine and deep learning gained on the thesis, two of which have provided me a strong platform for the following research In the future, to unleash and level up this thesis that more efforts and times need to be expanded to work on more research,thereby developing these current and other prediction models.

2.1 Web Traffic Prediction of Wikipedia Pages

I have been searched for such a lot of materials that related to the topic I currently dive deeply into and find a way to address the problem and here are my two of choice. The first of which is a direct piece of paper associated with my thesis by using forecasting model, namely ARIMA, LSTM RNN The second one mainly uses RNN seq2seq model and then investigate the use of symmetric mean absolute percentage (SMAPE) for measuring the overall performance and accuracy of the developed model Finally, I compare the outcome of our developed model to existing ones to determine the effectiveness of our proposed method in predicting future traffic of Wikipedia articles.

Numerous techniques for predicting web traffic have been put out in the literature Based on the examined models, they may be broadly divided into two groups:nonlinear prediction and linear prediction The most popular models The Holt-Winters algorithm is one example of a linear forecast model: AR Model, MA Model for nonlinear prediction, recurring neural networks! forecasting function is frequently used.The discrete wavelet transforms (DWT), which increases forecast accuracy, separates the data into linear and non-linear components [2] By leveraging GPU computing to train the dataset, ES-RNN performs better India was chosen as the sample country for the analysis of the dataset The dataset was additionally split into training and testing sets For the time series, I plotted the number of hits per day for the article "India" together with actual data and projections The Time Interval is represented on the x-axis,while the Page Visits are represented on the y-axis in powers of 10 As a result, here is what I can conclude: Long Short-Term Memory Recurrent Neural Network withAutoregressive Integrated Moving Average can be used to predict web traffic time series more correctly and efficiently It is possible to predict how many people will visit the website in the future As more user data is fed into the system, it will continue to get better Our solution may be applied to any website to enhance business analysis and online traffic load management [3] Our system is more efficient because to LSTM RNN Using our technology, seasonal cycles and long-term trends are accurately captured Our model might be improved by including information on holidays, the day of the week, language, and region to better capture the highs and lows.

2.2 Web Traffic Time Series Forecasting using ARIMA and LSTM RNN

Compared to the first one, I examined the winning model that was submitted to the Kaggle competition and used the RNN seq2seq model from This prediction model is based on the following data:

— The total number of visit;

— Features taken from page URLs;

— The day of the week, which analyzes the weekly seasonality data;

— Year-to-year autocorrelation (quarterly and yearly);

With the use of an encoder/decoder architecture, I reconstructed the previously successful model utilizing the complete dataset as training data CUDNN GRU [7] is the encoder of choice since it completes tasks more quickly than conventional tensors. TensorFlow GRUBlockCell is the decode [7] From the subsequent step until the end of the sequence, the decoding results are used as inputs for a given batch size According to the observed data, a better model will have lower measurement values Despite the fact that they collected all of the forecasts from September 13 to November 13 of 2017,they cannot be compared to the actual web traffic figures They used 740 days as a training set to forecast traffic levels from July 11th, 2017, to September 11th, 2017, in order to visualize the disparities between real data and expected data For four separate web pages, they have shown the number of hits vs days for the full time series, along with the actual values and projections during the testing period Predictions made with the new model are generally more accurate On two seeds, they have concurrently run three models They created the resulting SMAPE graph using Tensor Board for each of the three old and new models to check for variances in model performance at each stage.

As they use sequential information and run the model for each component of the sample, recurrent neural networks (RNNs) perform best when given minimal inputs For more information, see RNN can estimate the percentage of online traffic for which it was able to make predictions with stability based on past calculations while using simple median as a feature, which is why rolling median and Fibonacci median as features resulted in a minor improvement.

3 THEORETICAL KNOWLEDGE OF TIME-SERIES ANALYSIS AND

3.1 The definition of time series analysis

A time-series is a group of data points that have been indexed chronologically in mathematics A time-series is most frequently a sequence captured at a series of equally spaced moments in time As a result, it is a collection of discrete-time data.

A particular method of examining a set of data points gathered over a period is called a "time series analysis." Instead of just capturing the data points intermittently or randomly, time series analyzers record the data points at regular intervals over a predetermined length of time. model prediction, model fitting forecasting’

Figure 3 1 Basic Time-Series Prediction Model

- A Time-Series represents a series of time-based orders It would be Years,

Months, Weeks, Days, Horus, Minutes, and Seconds;

- A time series is an observation from the sequence of discrete time of successive intervals;

- A time series is a running chart;

- The time variable/feature is the independent variable and supports the target variable to predict the results;

- Time Series Analysis (TSA) is used in different fields for time-based predictions

— like Weather Forecasting, Financial, Signal processing, Engineering domain — Control Systems, Communications Systems;

- Since TSA involves producing the set of information in a particular sequence, it makes a distinct from spatial and other analyses;

- Using AR, MA, ARMA, and ARIMA models, I could predict the future. y đa ".

Time-series data Discover Dataset Analyze Dataset using

Figure 3 2 Time Series Analysis Process

3.2 Important terms to understand time series analysis

A Time-Series is a collection of data points that were collected at various times In essence, these are repeated measurements that were taken from the same data source at the same time Furthermore, I may track trends and changes over time using these chronologically obtained readings Both univariate and multivariate time-series models are available When the dependent variable is a single time series, such as a measurement of the room's temperature from a single sensor, the univariate time series models are used On the other hand, where there are numerous dependent variables and the output depends on multiple series, a multivariate time series model can be utilized The modelling of the GDP, inflation, and unemployment together as these variables are related to one another could serve as an example for the multivariate time-series model.

Stationary and Non-Stationary Time Series: Stationarity is a property of a time series A stationary series is one where the values of the series are not a function of time That is, the statistical properties of the series like mean, variance and autocorrelation are constant over time Autocorrelation of the series is nothing but the correlation of the series with its previous values, more on this coming up A stationary time series id devoid of seasonal effects as well;

Trend: The trend shows a general direction of the time series data over a long period of time A trend can be increasing (upward), decreasing (downward), and horizontal (stationary);

Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude Some examples include an increase in water consumption in summer due to hot weather conditions;

Cyclical Component: These are the trends with no set repetition over a particular period A cycle refers to the period of ups and downs, booms, and slums of a time series, mostly observed in business cycles These cycles do not exhibit a seasonal variation but generally occur over a period of 3 to 12 years depending on the nature of the time series;

Irregular Variation: These are the trends with no set repetition over a particular period A cycle refers to the period of ups and downs, booms, and slums of a time series, mostly observed in business cycles These cycles do not exhibit a seasonal variation but generally occur over a period of 3 to 12 years depending on the nature of the time series;

ETS Decomposition: ETS Decomposition is used to separate different components of a time series The term ETS stands for Error, Trend and Seasonality;

Dependence: It refers to the association of two observations of the same variable at prior time periods;

Differencing: Differencing is used to make the series stationary and to control the autocorrelations There may be some cases in time series analyses where I do not require differencing and over-differenced series can produce wrong estimates;

Specification: It may involve the testing of the linear or non-linear relationships of dependent variables by using time series models such as ARIMA models;

THEORETICAL KNOWLEDGE OF TIME-SERIES ANALYSIS

Prophet Model + tk vn HH HT HT Hit 15 khằ 0n 17 CHAPTER 4 OUR EXPERIMENTTS 55s Sseseeetsesrserriersrereesree 21 4.1 Our Dataset

Facebook is open sourcing Prophet, a forecasting tool available in Python and R. Forecasting is a data science task that is central to many activities within an organization.

For instance, large organizations like Facebook must engage in capacity planning to efficiently allocate scarce resources and goal setting in order to measure performance relative to a baseline Producing high quality forecasts is not an easy problem for either machines or for most analysts I have observed two main themes in the practice of creating a variety of business forecasts:

— Completely automatic forecasting techniques can be brittle and they are often too inflexible to incorporate useful assumptions or heuristics;

- Analysts who can produce high quality forecasts are quite rare because forecasting is a specialized data science skill requiring substantial experience.

The result of these themes is that the demand for high quality forecasts often far outstrips the pace at which analysts can produce them This observation is the motivation for our work building Prophet: I want to make it easier for experts and non- experts to make high quality forecasts that keep up with demand.

The typical considerations that “scale” implies, computation and storage, aren’t as much of a concern for forecasting | have found the computational and infrastructure problems of forecasting a large number of time series to be relatively straightforward

— typically these fitting procedures parallelize quite easily, and forecasts are not difficult to store in relational databases such as MySQL or data warehouses such as Hive.

The problems of scale I have observed in practice involve the complexity introduced by the variety of forecasting problems and building trust in a large number of forecasts once they have been produced Prophet has been a key piece to improving Facebook’s ability to create many trustworthy forecasts used for decision-making and even in product features.

Not all forecasting problems can be solved by the same procedure Prophet is optimized for the business forecast tasks I have encountered at Facebook, which typically have any of the following characteristics:

Hourly, daily, or weekly observations with at least a few months (preferably a year) of history;

Strong multiple “human-scale” seasonality: day of week and time of year;

Important holidays that occur at irregular intervals that are known in advance (e.g. the Super Bowl);

A reasonable number of missing observations or large outliers;

Historical trend changes, for instance due to product launches or logging changes; Trends that are non-linear growth curves, where a trend hits a natural limit or saturates.

At its core, the Prophet procedure is an additive regression model with four main components:

A piecewise linear or logistic growth curve trend Prophet automatically detects changes in trends by selecting changepoints from the data;

A yearly seasonal component modelled using Fourier series;

A weekly seasonal component using dummy variables;

A user-provided list of important holidays;

As an example, here is a characteristic forecast: log-scale page views of Peyton Manning’s Wikipedia page that I downloaded using the Wikipedia trend package. Since Peyton Manning is an American football player, you can see that yearly seasonality plays and important role, while weekly periodicity is also clearly present Finally, you see certain events (like playoff games he appears in) may also be modelled [17]

LSTM, standing for Long Short-Term Memory, is an artificial neural network used in the fields of artificial intelligence and deep learning [8] Unlike standard feedforward neural networks, LSTM has feedback connections Such a recurrent neural network (RNN) can process not only single data points (such as images), but also entire

17 sequences of data (such as speech or video) For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, machine translation, robot control, video games, and healthcare LSTM has become the most cited neural network of the 20th century.

Humans do not start their thinking from scratch every second As I read this essay,

I understand each word based on our understanding of previous words I do not throw everything away and start thinking from scratch again Your thoughts have persistence.

Traditional neural networks can not do this, and it seems like a major shortcoming. For example, imagine I want to classify what kind of event is happening at every point in a movie It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

They are networks with loops in them, allowing information to persist The name of LSTM refers to the analogy that a standard RNN has both "long-term memory" and

"short-term memory" The connection weights and biases in the network change once per episode of training, analogous to how physiological changes in synaptic strengths store long-term memories; the activation patterns in the network change once per time- step, analogous to how the moment-to-moment change in electric firing patterns in the brain store short-term memories The LSTM architecture aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory".

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs Relative

18 insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications.

Figure 3 6 LTSM Model The Core Idea Behind LSTMs:

- The key to LSTMs is the cell state, the horizontal line running through the top of the diagram;

- The cell state is kind of like a conveyor belt It runs straight down the entire chain, with only some minor linear interactions It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

Figure 3 8 LSTM Gate The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

Google's Web Traffic Time Series Forecasting (supplied by Kaggle) [4], which contains over 145,000 Wikipedia articles, is the main dataset utilized for this project. The dataset has a field that represents a time series or multiple points that are ordered chronologically For instance, from July 1, 2015, through December 31, 2016, each time series indicates a number of daily visits of a separate Wikipedia page I must take into account some ambiguity in the overall forecasts in our prediction model because the dataset does not distinguish between values of zero and missing values in the traffic data with train_*.esv - contains traffic data. page 2015: 2015- 2015- 2015- 2015- 2015- 2015- 2015- 2015- 2016- 2016-

O 2NE1_zh.wikipedia.org_all-access_spider 180 10 50 130 140 90 90 220 260 320 630

1 2PM_zhwikipedia.org_all-access spider 110 140 150 180 THƠ 130 220 11.0 100 170 420

3 Aminute_zhwikipediaorg-all- so 130 100 940 40 260 140 90 TH 320 100 access_spider

4 52.2 | Love You_zhwikipediaorg_allt jay NaN NaN NHAN NAN NaN NaN NaN NaN 480 — 90 access_s

Figure 4 1 Wikipedia Web Traffic Dataset in wrong time-series format

2NE1_zhwikipedia.org_all- 2PM_zhwikipedia.org_all- 3¢_zhwikipedia.org_sll- Aminute_zhwikipedia.org_all- §2_H2_\ Love You_zhwikipedia.org_all- 8566_zh.wikipedi.org_all-

‘access_spider ‘aecess_spider access_spider access_spider ‘access_spider ‘access_spider

2018- bà tạo 180 10 sáo Nan số 04

Figure 4 2 Wikipedia Web Traffic Dataset in proper time-series format

This a csv file where each row corresponds to a particular article and each column correspond to a particular date Some entries are missing data The page names contain

21 the Wikipedia project (e.g en.wikipedia.org), type of access (e.g desktop) and type of agent (e.g spider) In other words, each article name has the following format:

‘name_project_access_agent', for example, “2PM_zh.wikipedia.org all-access spider'. topic lang access type

1 2PM zh all-access spider

3 4minute zh al-access spider

4 52_Hz|[_LoveYou zh all-access spider

4995 Grotte_de_Lascaux ír desktop all-agents

4996 Groupe_Bilderberg ífr desktop all-agents

4997 Groupe Caisse d%27%C3%A9pargne fr desktop all-agents

4998 Groupe_sanguin fr desktop all-agents

4999 Grzegorz_Krychowiak fr desktop all-agents

Discovering basic dataset information

To start, here are some basic visualizations of the dataset's training data that can provide me some insight into how I might expand our analysis beyond simple regression models As a result, I can also see that there are numerous possible challenges in producing accurate predictions I will downcast everything to an integer to conserve memory I will do this after because Pandas does not allow I to set columns containing NaN values to integer types automatically while reading a file As a result, the memory should shrink from 600 Mbyte to 300 Mbyte Since views are an integer type by nature, no information is lost However, I might prefer floating-point predictions.

Then, How the many languages used in Wikipedia might affect the dataset is one thing that might be interesting to examine | will search for the language code in the Wikipedia URL using a straightforward regular expression Additionally, a lot of URLs that are not on Wikipedia will not pass the RegEx search [A regular expression is a form of advanced searching that looks for specific patterns, as opposed to certain terms and phrases With the RegEx I can use pattern matching to search for particular strings of characters rather than constructing multiple, literal search queries] Since I do not know what language these Wikimedia sites (wikimedia.org) and Common _ site (commons.wikimedia.org) are in, I will give them the code "other" A large portion of these will be non-linguistic items like photos According to the dataset above, I can conclude some basic information of it as follow:

Columns: 551 entries, Page to 2016-12-31 dtypes: float64(550), object (1)

Now, I create a table to show clearly Project name, Number of columns, Number of columns with null, % of null, total views and average views as the table below:

Project Number of columns No of columns with nulls % of nulls Total views Average views ° zhuikipesia.org 1229 4039 23443032 3171986333 184107

Figure 4 4 Web Traffic Dataset dividing project, number of columns, no of column with nulls, % of nulls, total views, average views

4.2.2 The distribution of Language, Access, and Type of agent

Here is the overall distribution of Language, Access, and Type Looking at theer charts below, I can see English is the most popular language used in Wikipedia, the equal of two types of access which are desktop and mobile.

Project (based on Language) Number of columns

0 de.wikipedia.org (de — German) 18547

1 en.wikipedia.org (en — English) 24108

2 es.wikipedia.org (es- Spanish) 14069

4 ja.wikipedia.org (ja — Japanese) 20431

5 tu.wikipedia.org (ru — Russia) 15022

6 zh.wikipedia.org (zh — Chinese) 17229

Figure 4 5 The distribution of Language

Access type Number of columns

Figure 4 6 The distribution of Access

Figure 4 7 The distribution of Type of agent

4.2.3 How languages affect the view per page/project?

How the various Wikipedia languages may affect the dataset is one thing that would be interesting to investigate To find the language code in the Wikipedia URL, I will employ a straightforward regular expression Numerous URLs that are not in Wikipedia will also be rejected by the regex search Since I do not know the language of these Wikimedia pages, I will use the code "others" for them Many of them will be non- linguistic objects, such photos.

In addition to the media pages, there are 7 languages Here, English, Japanese, German, French, Chinese, Russian, and Spanish are all used As a result, it will be challenging to analyze the URLs because there are four different writing systems to consider (Latin, Cyrillic, Chinese, and Japanese) Here, I will make data frames for the various entry kinds I will then figure out the total number of views I should point out that the sum may probably double count some of the views because the data originates from numerous sources So then how does the total number of views change over time?

I will plot all the different sets on the same plot.

Since Wikipedia is a US-based website [9], it makes sense that English exhibits a substantially higher number of visitors each page Contrary to what I had anticipated, there is a lot more organization here There are numerous additional spikes in the English data later in 2016, and both the English and Russian plots indicate very high peaks around day 400 (about August 2016) My best opinion is that this is a result of the US election as well as the August Summer Olympics [16].

The English data also has an odd trend around day 200 and the Spanish data is also fascinating There is a distinct cyclical pattern there, with a quick period of about one week and what appears to be a substantial dip every six months or so Since it looks like there is some periodic structure here, I will plot each of these separately so that the scale is more visible.

Pages/Projects in Different Languages

Figure 4 8 Pages/Projects in Different Language

To be more specific, 1 will show the popular pages on language:

16 Average views per each pageJproject.

20 os o0 if § F GF F PF EG PPP FE PoE Eq

Figure 4 9 Average Views Per Each Page/Project

Popular pages in en.wikipedia.org

Main Page_en wikipedia org_all-access_all-agents

Main Page_en.wikipedia.org_desktop_all-agents

Main Page_en.wikipedia org_mobile-web_all-agents

Special'Search_en wikipedia org_all-access.all-agents

— ‘SpecialSearch_en wikipedia.org_desktop_all-agents

1Ð os oo ot Jan Aer wl ot ais pate

Figure 4 10 Popular Pages In en.wikipedia.org

166 Popular pages in es.wikipedia org

— Wepedla Portada, es wipeds org alacces slagens

— Wpedia Portada es widpeda or mobie.wcb al-agcnts

— special 8u, ca wikipedia org aiscces

M Ny ATA bh Oct Jpn Aer M Ot ais bate

Figure 4 11 Popular Pages In es.wikipedia.org

28 te? Popular pages in ru.wikipedia.org

— 3rnaenan_crpanunua nu wikipedia org_allaccess_all-agents

*araasHaR,CTpawưua, ru wikipedia org_desktop_allagents

3araawaR_cTpanwua ru wikipedia org_mobile-web_all-agents

— Crymecnas:Noncn_ru wikipedia org_all-access_al-agents

— Crymesnas Noncx ru wikipedia org_desktop_allagents

Figure 4 12 Popular Pages In ru.wikipedia.org k6 Popular pages in de wikipedia org

40 'Wdpedis Hauptsete de wikipedia org_all-ccess_all-agents

‘ikcpedia Hauptsete de wikipedia org mabile-web_all-agents Wikipedia Hauptsete de wikipedia org_ desktop all-agents

Spezial Suche de wikipedia org_all-access_allagents Spezial Suche_de wikipedia org desktop allagents

Figure 4 13 Popular Pages In de.wikipedia.org

00000 | — (rrr Ja wikipedia org_all-access_allagents

Popular pages in ja wikipedia.org

— CET Ja.wikipedia.org_desktop_all-agents

— 1s ja wikipedia.org_desktop_all-agents

— (ALJ wikipedia.org_alaccess_all-agents

600000 | — DL Ja.wikipedia.org_desttop_allagents

Figure 4 14 Popular Pages In ja.wikipedia.org

"6 Popular pages in fr.wikipedia.org

— ‘Whipécia:Accuell principal frwikipedia org al-acccss 3l-agents

— Wikipédia:Accuel_ principal trwikipedia.org_mobile-web allagents

— Wepédis.Accuelpincipalfrwkipedia org desktop al'agents

— Spécial Recherche trwikipedia.org_alaccess_al-agents

— Spéciat Recherche fravkipedia.org_desktop_alagents m6

Figure 4 15 Popular Pages In fr.wikipedia.org

—— Wikipedia:ch zh wikipedia org,alLaccess_al:-agents

— Wikipedia:ch zh wikipedia org desktop al-agents

— Wikipedia zh wikipedia org_mobile-web_all-agents

Popular pages in zh wikipedia.org

Figure 4 16 Popular Pages In zh.wikipedia.org

16 Popular pages in commons.wikipedia.org

— Special CreateAccount commons wikimedia org_al:access al-agents

— special CreateAccount.commons wikimedia org desktop alLagents

— Special Search_commons.wikimedia or,

— Main_Page commons wikimedia org al:access all agents

Popular pages in www.mediawiki.org

50000 jul Oct Jan Aor jul Oct

Figure 4 18 Popular Pages In www.mediawiki.org

I will now plot the data for a few distinct entries I chose a few entries for I to review, but they do not necessarily stand out.

First, the chart is shown as below regrading some English pages:

“Awaken, My Love!"_en.wikipedia.org_desktop_all-agents

Figure 4 19 The number of views in English Pages — Awaken My Love

10_Cloverfield_Lane_en.wikipedia.org_desktop_all-agents

Figure 4 20 The number of views in English Pages — Cloverfield Lane

Fiji_Water_en.wikipedia.org_desktop_all-agents

Figure 4 21 The number of views in English Pages — Fiji Water

Internet_of things_en.wikipedia.org_desktop_all-agents

Figure 4 22 The number of views in English Pages — Internet of things

2016_North_Korean_nuclear_test_en.wikipedia.org_desktop_all-agents

Figure 4 23 The number of views in English Pages — 2016 North Korean Nuclear Test

A Series_of_Unfortunate_Events_(TV_series) en.wikipedia.org_desktop_all-agents

Figure 4 24 The number of views in English Pages — TV Series Unfortunate Events views

Hunky_Dory_en.wikipedia.org_desktop_all-agents

Figure 4 25 the number of views in English Pages — Hunky Dory

I can observe that the data is not smooth for pages There are unexpectedly high spikes, significant changes in the average number of views, and other things Additionally, it is obvious how current events have an impact on Wikipedia opinions A Wikipedia page about the 2016 North Korean nuclear test was swiftly created and acquired many views in a short period of time In one to two weeks, the number of views generally decreased.

Around the time of David Bowie's passing at the beginning of 2016, Hunky Dory attracted a sizable audience A modest amount of people visited the page regarding the shooting competition at the 2016 Olympics, and then suddenly a lot of people did around the Olympics.

There are also some oddities, like two huge spikes in the data for Fiji Water, and the sudden long-term increases in traffic to "Internet of Things" and "Credit score." Maybe there were some news stories about Fiji water on those days For the others, maybe there was a change in search engine behavior or maybe some new links appeared in very visible locations.

Now I will look at some Spanish entries as follow:

“00 Anorexia_nerviosa_es.wikipedia.org_desktop_all-agents

Figure 4 26 The number of views in Spanish Pages — Anorexia

Charles_Perrault_es.wikipedia.org_desktop_all-agents

Figure 4 27 The number of views in Spanish Pages — Charles Perrault

Fiesta_de_la_Candelaria_es.wikipedia.org_desktop_all-agents

Figure 4 28 The number of views in Spanish Pages — Fiesta

La_usurpadora_(telenovela_mexicana)_es.wikipedia.org_desktop_all-agents

Figure 4 29 The number of views in Spanish Pages — La-usurpadora

Compared to the English data, this exhibits significantly more dramatic short-term increases It may be a hint that something is wrong with the data if some of these only last one or two days before returning to the mean Something like a median filter can be used to get rid of very short spikes, which I virtually surely won't be able to forecast.

But I notice something quite interesting here I can see that specific pages contain a very strong periodic structure It turns out that all the graphs exhibiting the strongest periodic structure share a connection to health-related themes.

If the weekly framework is connected to individuals visiting doctors and then using Wikipedia, that might make sense It is more difficult to explain the lengthier structure (6 months), especially without knowing the demographics of the browser users.

Leicester_fr.wikipedia.org_desktop_all-agents

Figure 4 30 The number of views in French Pages — Leicester

Roselyne_Bachelot_fr.wikipedia.org_desktop_all-agents

Figure 4 31 The number of views in French Pages — Roselyne

Noel_fr.wikipedia.org_desktop_all-agents

Figure 4 32 The number of views in French Pages — Noel

There is more of the same in the French schemes Once more, the number of views on Wikipedia greatly depends on what is trending in the media Leicester FC won the Premier League and attracted lots of traffic during the competition Their page saw a dramatic increase in visitors following the Olympics The framework of Christmas has an intriguing progression during Advent.

I briefly touched on some of the potential issues with the aggregated data, so now

I will take a closer look at the most visited pages, which are often the main landing pages for the languages represented in this dataset. en

Page Main_Page_en.wikipedia.org_all-access_all-agents

Main_Page_en.wikipedia.org_desktop_all-agents Main_Page_en.wikipedia.org_mobile-web_all-agents Special:Search_en.wikipedia.org_all-access_all

Special:Search_en.wikipedia.org_desktop_all-ag

Special :Search_en.wikipedia.org_mobile-web_all

Special :Book_en.wikipedia.org_all-access_all-a

Special :Book_en.wikipedia.org_desktop_all-agents

Main_Page_en.wikipedia.org_all-access_spider Special :Search_en.wikipedia.org_all-access_spider

Figure 4 33 English Language represented in dataset

At >z4—3_ja.wikipedia.org_a11-access_a11-agents x4 YXR-Y_ja.wikipedia.org_desktop_all-agents total

#35II:‡Ä5_ja.wikipedia.org_a11-access_all-agents 79316929.0

#$7I|:‡&5_ja.wikipedia.org_desktop_a1l-agents 69215206.0

At YXR-Y_ja.wikipedia.org_mobile-web_all-agents

455: VO BRM_ja.wikipedia.org_all-access_all-agents

$55 : 8œ BM_ja.wikipedia.org_desktop_all-agents

BA(S_ja.wikipedia.org_all-access_all-agents 10793039 0

123292 455) :SMBU 2†‡â#_ja.wikipedia.org_a11-access_a11-agents

89463 385I| :2ẩọf 1) Y ORR_ja.wikipedia.org_desktop_all-agents

Figure 4 34 Japanese Language represented in dataset de

Page Wikipedia:Hauptseite_de.wikipedia.org_all-acce

Wikipedia:Hauptseite_de.wikipedia.org_mobile-w

Wikipedia:Hauptseite_de.wikipedia.org_desktop_

Spezial:Suche_de.wikipedia.org_all-access_all-

Spezial:Suche_de.wikipedia.org_desktop_all-agents Spezial:Anmelden_de.wikipedia.org_all-access_a

Special :Search_de.wikipedia.org_all-access_all

Spezial:Anmelden_de.wikipedia.org_desktop_all-

Hauptseite_de.wikipedia.org_all-access_all-agents total

Figure 4 35 German Language represented in dataset na

45071 Special:Search_commons.wikimedia.org_all-acces

81665 Special:Search_commons.wikimedia.org_desktop_a

45056 Special:CreateAccount_commons.wikimedia.org_al

45028 Main_Page_commons.wikimedia.org_all-access_all

81644 Special:CreateAccount_commons.wikimedia.org_de

81618 Main_Page_commons.wikimedia.org_desktop_all-ag

Special :UploadWizard_commons.wikimedia.org_all

Special :UploadWizard_commons.wikimedia.org_des

Special :RecentChangesLinked_commons.wikimedia total

Figure 4 36 Unidentified Language represented in dataset

27330 Wikipédia:Accueil_principal_fr.wikipedia.org_a 868489667.0

55104 Wikipédia:Accueil_principal_fr.wikipedia.org_m 611302821.0

7344 Wikipédia:Accueil_principal_fr.wikipedia.org_d 239589012.@

27825 Spécial:Recherche_fr.wikipedia.org_all-access_ 95666374.0

8221 Spécial:Recherche_fr.wikipedia.org_desktop_all 88448938 0

26500 Sp?cial:Search_fr.wikipedia.org_all-access_all 76194568 0

6978 Sp?cial:Search_fr.wikipedia.org_desktop_all-ag 76185450 0

131296 Wikipédia:Accueil_principal_fr.wikipedia.org_a 63860799 @

26993 Organisme_de_placement_collectif_en_valeurs_mo 36647929.0

7213 Organisme_de_placement_collectif_en_valeurs_mo 36624145.0

Figure 4 37 French Language represented in dataset zh

28727 Wikipedia:87I_zh.wikipedia.org_all-access_all-a 123694312.0

61359 Wikipedia: Ằ_zh.wikipedia.org_desktop_a11-agents 66435641.

105844 Wikipedia: §M_zh.wikipedia.org_mobile-web_all-a 50887429 6

28728 Special :#@%_zh.wikipedia.org_all-access_all-agents 48678124.0

61351 Special :#22%_zh.wikipedia.org_desktop_all-agents 48203843.0

28089 Running_Man_zh.wikipedia.org_all-access_all-ag 11485845 0

30960 Special:ÿ#‡#†‡#Z_zh.wikipedia.org_all-access_all-a 10320403.0

63510 Special :8#‡#‡##_zh.wikipedia.org_desktop_all-agents 10320336.0

60711 Running_Man_zh.wikipedia.org_desktop_all-agents 7968443 0

30446 1# _ ( SARH)) _zh.wikipedia.org_all-access_all-agents 5891589.0

Figure 4 38 Chinese Language represented in dataset ru

99322 3arnaBHaA_cTpaHuua_ru.wikipedia.org_all-access 1.086019e+09

103123 3arnaBHaA_cTpaHuua_ru.wikipedia.org_desktop_al 7.42880@e+08

17678 3arnaBHaA_ctTpaHuua_ru.wikipedia.org_mobile-web 3.279304e+08

99537 Cnywe6Hasa:llowck_ru.wikipedia.org_all-access_al 1.037643e+08

103349 Cnyxe6Has :Mouck_ru.wikipedia.org_desktop_all-a 9.866417e+87

100414 Cnyxe6Han:Ccbinku_cioga_ru.wikipedia.org_all-acc 2.51020@e+07

104195 Cnywe6Has :Ccbinkw_cionga_ru.wikipedia.org_desktop 2.505816e+07

97678 Special:Search_ru.wikipedia.org_all-access_all 2.437457e+@7

191457 Special:Search_ru.wikipedia.org_desktop_all-ag 2.195847e+87

98301 Cnyxe6Has:Bxon_ru.wikipedia.org_all-access_all 1.216259e+@7

Figure 4 39 Russian Language represented in dataset

92205 Wikipedia:Portada_es.wikipedia.org_all-access_ 751492304.0

95855 Wikipedia:Portada_es.wikipedia.org_mobile-web_ 565077372.0

90819 Especial:Buscar_es.wikipedia.org_all-access_al 194491245.0

71199 Wikipedia:Portada_es.wikipedia.org_desktop_all 165439354.0

69939 Especial:Buscar_es.wikipedia.org_desktop_all-a 160431271.0

94389 Especial:Buscar_es.wikipedia.org_mobile-web_al 34059966 0

99813 Especial:Entrar_es.wikipedia.org_al1-access_a1l 33983359.0

143440 Wikipedia:Portada_es.wikipedia.org_all-access_ 31615489 0

93094 Lali_Esposito_es.wikipedia.org_all-access_all- 26602688 0

69942 Especial:Entrar_es.wikipedia.org_desktop_all-a 25747141.9

Figure 4 40 Spanish Language represented in dataset le7 Main_Page_en.wikipedia.org_all-access_all-agents

Figure 4 41 Chart represented a specific English page, all access and all agents

(0000 _ja.wikipedia.org_all-access_all-agents

Figure 4 42 Chart represented a specific Japanese page, all access and all agents

Wikipedia:Hauptseite_de.wikipedia.org_all-access_all-agents

Figure 4 43 Chart represented a specific German page, all access and all agents

Special:Search_commons.wikimedia.org_all-access_all-agents

Figure 4 44 Chart represented a specific Unidentified page, all access and all agents

Wikipédia:Accueil_principal_fr.wikipedia.org_all-access_all-agents

Figure 4 45 Chart represented a specific French page, all access and all agents

Wikipedia:(]_zh.wikipedia.org_all-access_all-agents

Figure 4 46 Chart represented a specific Chinese page, all access and all agents le7 3arnasnHas_cTpaHWuua_ru.wikipedia.org_all-access_all-agents

Figure 4 47 Chart represented a specific Russian page, all access and all agent

Wikipedia:Portada_es.wikipedia.org_all-access_all-agents

Figure 4 48 Chart represented a specific Spanish page, all access and all agents

I can observe that most things are comparable when I compare these to the aggregated data The fact that the Olympics would have such a significant impact on a website like Wikipedia honestly surprises me quite a bit The most varied statistics, in my opinion, are those in Japanese, Spanish, and the media This is expected for media pages since most users will access them through links from other websites rather than the home page or search bar It is possible that the dataset is not representative of all traffic to Wikipedia because several of the languages exhibit significant discrepancies between the main page and the aggregated data.

Using Prediction Model to forecast the traffic flow on Wikipedia

4.3.1 Check Stationary and Non-stationary

On the top of that, my thesis has checked whether the time-series is stationary or non-stationary that’s why my group has used Augmented Dickey-Fuller test to check it.

It is a hypothesis testing in which the null hypothesis is that the data is non-stationary:

- p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.

— p-value

Tiêu đề	Analyzing and Forecasting Wikipedia Web Traffic
Tác giả	Nguyen Le Khoa
Người hướng dẫn	Dr. Do Trong Hop
Trường học	University of Information Technology
Chuyên ngành	Information Systems Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	88
Dung lượng	39,57 MB