Luận văn thạc sĩ Khoa học máy tính: Applying Machine Learning Techniques in Extracting Information From the Log File

My contributions include providing some useful metrics to extract valuable knowledge from log files, exploiting some superior techniques in the field of natural language processing to ad

INTRODUCTION

Overview

The rapid development of Internet technology and infrastructure has caused the explosion of Internet usage Since all of our activities on the Internet are recorded, an enormous amount of log files is being generated every day Although the log files source is diverse, they share a number of characteristics For example, the amount of data, as well as information on log files, is huge and gradually grows; data of all types exist on the log, such as unstructured, semi-structured, and structured; data is not usually reliable and always contains noise; and many more All of these characteristics present both challenges as well as opportunities for mining and discovering useful information and knowledge from the log files

Specifically, the emergence of social network has also brought an enormous and ever-growing amount of online content into our digital world In this context, video contents are determined to be a dominant cause of network congestion as it would account for more than 80% of total Internet traffic by 2020 [1] It has been revealed that user attention is distributed in a skewed fashion as a few contents receive massive views and downloads, whereas most contents get few user attention [2] By accurately predicting the popularity of online content in the future, the network operators can proactively manage the distribution as well as cache replacement policies for online contents across their infrastructures For services providers, they can exceedingly benefit from designing appropriate advertising strategies and recommendation schemes which encourage their users to reach the most relevant and popular contents Thus, predicting the popularity of online contents, especially videos, is of great importance as it supports and drives the design and management of various services Several efforts have been made to predict the long-term popularity of online videos by analyzing their historical accesses [2], [3], [4] However, it has been hard to accurately predict the popularity of a given content in the near future or make the short-term prediction In this thesis, I mainly focus on exploiting several mechanisms and techniques in the field of time series prediction to address the problem

15 Fundamentally, predicting in time series involves taking models to fit on historical data and applying them to determine the future value Since time series prediction is a basic problem in various domains, it has usually been applied in wide- ranging and high-impact applications such as financial market prediction [5], weather forecasting [6], predicting the next frame of a given video [7], complex dynamical system analysis [8] and so on There are many different methods for performing time series prediction such as exponential moving average [9], [10], autoregressive model [11], polynomial regression [12], autoregressive integrated moving average [13], [14] and so on Notably, Recurrent Neural Network (RNN) and its variances recently have outperformed the traditional methods and become the standard for some time series analysis problems Besides, several studies on Natural Language Processing (NLP) addressing some problems such as sentiment analysis, sequence-to-sequence translation, recently result in many superior mechanisms and techniques, that can be widely applied in other areas, especially in time series prediction

Although all of the aforementioned algorithms and techniques can achieve substantial results in time series prediction, applying them to address the problem in predicting online content popularity is a fascinating and challenging task

Hence, the purposes of this thesis are:

• Analyzing the characteristics of some real datasets including the popularity of online contents on HCMUT’s website [15], Youtube [16], Movielens [17] and so on

• Applying mechanisms and techniques in time series prediction to address the short-term prediction problem as well as proposing appropriate models to predict online content popularity based on the historical accesses

• Evaluating the proposed models on real datasets in comparison with several baseline methods

Contribution

a Practical contribution: § Providing several useful metrics to discover the characteristics of online contents After pre-processing the raw logs, selecting appropriate metrics is of great importance to understand the characteristics of the dataset as well as extract valuable knowledge from it Thereby, we can design suitable predicting models § Predicting online content popularity with reasonable accuracy

Knowing the popularity of online contents, the service provider can select appropriate marketing strategies, advertisers can maximize their revenues through better advertising placement [2] For network management, the network operators can proactively manage the bandwidth requirements and effectively deploy their caching systems [18] § Providing effective models in time series prediction As predicting in time series is a basic problem in many domains, the proposed models in this thesis can also be widely applied in many other areas such as financial market prediction [5], weather forecasting [6], predicting the next frame of a given video [7] and complex dynamical system analysis [8] b Scientific contribution: § Applying several mechanisms in the field of Natural Language Processing (NLP) into time series prediction Basically, the attention mechanism is a ubiquitous method used in modern deep learning models to tackle a variety of tasks such as language modeling [19], sentiment analysis [19], natural language inference [20], abstractive summarization [21], learning task-independent sentence representation [22] and so on

Here, one of the proposed models is the combination of two attention mechanisms to address time series prediction § Proposing effective models to address the short-term prediction problem in predicting online content popularity Although the field of predicting online content popularity has been extensively researched in the recent decade, there are a few studies focus on accurately predicting the

17 popularity of a given content in the near future [4], [3] In this thesis, I apply several state-of-the-art techniques in time series prediction to tackle this problem § Improving significantly inference performance compared to baseline methods Without using any recurrent or convolutional neural network, the proposed models are highly parallelizable, which dramatically reduces the training and testing time In addition, the empirical results also show that my models can outperform the baselines when practicing on some real datasets.

Research scope

§ Generally, this thesis focuses on extracting useful knowledge from log files, it provides several metrics to help us understand the characteristics of some real datasets that I used in the experiments § In particular, I consider some mechanisms and techniques in the field of time series prediction and natural language processing to address the short- term prediction problems, especially predicting online content popularity.

Thesis Outline

My thesis includes seven chapters in total, the rest of it is organized as follows:

Chapter 2 provides the background knowledge related to techniques and mechanisms being used in this thesis Chapter 3 outlines some related research that have been presented in the field of log analysis, predicting online content popularity as well as time series prediction Chapter 4 presents some real datasets, explores their characteristics and explains some analysis results That chapter also draws the challenges in time series prediction while applying on these datasets In Chapter 5, I propose two novel methods and describe their implementation Chapter 6 discusses the empirical results of my models in comparison to some baselines Finally, Chapter 7 draws the conclusions and proposes future work

BACKGROUND

Weblog Mining

Weblog Mining can be considered as a sub-domain of data mining, a process of discovering useful patterns or knowledge from data sources such as databases, texts, web, and so on Therefore, weblog mining is also carried out in three main steps [23]:

• Pre-processing: a process that removes noises and unnecessary information from the raw data Moreover, as the raw data is not always suitable for the mining algorithms due to many reasons, it needs to be transformed before being mined In weblog mining, the process may contain several sub-processes such as data cleaning, pageview identification, user identification, sessionization, path completion, data integration, and so on

• Data Mining: the process that applies some data mining algorithms which produce patterns or knowledge

• Post-processing: In fact, the discovered patterns are not always useful

This process identifies which ones can be applied in real-world applications

Fundamentally, this thesis will provide some techniques to pre-process weblogs, and some machine learning algorithms to predict the popularity of online contents The results can be exceedingly applied in optimizing the caching policies as well as recommendation systems

Data cleaning is usually site-specific, it contains many tasks such as removing non-essential references to the embedded object that may not be important for mining purposes For example, references to style files, graphics, or sound files are usually be eliminated The data cleaning sometimes involves the removal of some meaningless fields of the data which may not provide useful information during the mining tasks

Similar to data cleaning, identification of pageview or so-called access identification is always site-specific since it heavily depends on not only the intra- page architecture but also the page contents and some of the underlying site domain knowledge Basically, a specific user event may provoke several requests to web objects or resources In other words, pageview identification is to determine the collection of web objects and resources that represents the user event while accessing contents on the website For the static website, each HTML file usually corresponds to a pageview However, most websites are dynamic and their pageviews may be constructed by some static templates coupled with contents generated by the server applications based on a set of parameters That also explains why pageview identification requires some site domain knowledge

In fact, WUM applications do not always require knowledge about the user’s identity but it is necessary to distinguish among different users [23] However, not all sites require user authentication for accessing contents or querying information, that also makes it hard to identify the actual user event Although IP address, alone, may not be sufficient for mapping log entries onto the set of unique users, it is still possible to accurately identify unique users by using the combination of IP address and some other information like user’s agents, referrers [24]; and applying some other techniques in pageview identification or sessionization

Sessionization is the process of grouping the activities of each user into sessions in order to identify the user access pattern In the absence of authentication mechanisms, this process must rely on some heuristic methods to sessionize user activities Here, the goal of a sessionization heuristic is to reconstruct the actual sequence of actions performed by a given user during one visit based on the clickstream data [23]

After pre-processing log files, some basic statistics can be applied to extract valuable information For example, grouping user accesses based on different geographical regions (country or province), plotting the trend of user preferences over time, classifying and analyzing user access pattern Moreover, some analysis based on gamma distribution and cumulative distribution also provide some insights into the distribution of user’s attention on different kinds of online contents All those techniques will be discussed in the later Chapter

Since other advanced machine learning techniques come from other fields of study such as time series prediction, language modeling, the background knowledge related to them will be provided in the next sections.

Time Series Prediction

Fundamentally, linear regression is one of the most basic and popular techniques applied in the field of time series prediction However, it achieved significant results in many real-world applications such as financial time series forecasting [25], [26], predicting content popularity [2], [3] These models were proposed based on the observations of the strong linear relations on some real datasets For instance, in the study of Szabo et al [2], the predicting model was described by the following formula [2]:

= ln 𝑟 𝑡 ) , 𝑡 & + ln 𝑁 $ 𝑡 ) + 𝜉 $ 𝑡 ) , 𝑡 & (1) where 𝑁 $ 𝑡 is the popularity of content s at time 𝑡, and 𝑡 ) , 𝑡 & are two arbitrarily chosen points in time, 𝑡 ) < 𝑡 & The factor 𝑟 𝑡 ) , 𝑡 & represents the linear relationship between the log-transformed popularities at 𝑡 ) and 𝑡 & , and this value is independent of s Finally, 𝜉 $ is a noise term drawn from a given distribution with mean 0 that describe the randomness observed in the data Since the model seems simple, it can only relatively predict long-term popularity

The later study of Pinto et al [3] provided the enhanced model for predicting video popularity by redefining the factor 𝑁 (the popularity) as follow:

𝑁 𝑣, 𝑡 0 , 𝑡 1 = Θ 1 3 ,1 4 𝑋 1 3 (𝑣) (2) where 𝑣 is a given video Θ 1 3 ,1 4 = (𝜃 ) , 𝜃 & , … , 𝜃 1 3 ) is the set of parameters that model has to learn, it depends only on 𝑡 0 and 𝑡 1 𝑋 1 3 𝑣 = (𝑥 1 (𝑣), 𝑥 2 (𝑣), … , 𝑥 𝑡 𝑟 (𝑣) ) > is the feature vector, and 𝑥 ? (𝑣) is the number of views received by video 𝑣 on the i th day since it was uploaded The experimental results proved that Pinto’s model brought substantial improvement but it was not robust enough to accurately predict the popularity of online contents in the short-term

Recently, non-linear models are widely applied for a variety of applications, one of their uses is for forecasting, especially forecasting for time series The rapid development of deep learning has resulted in many superior predicting models The idea is giving the previous values of the target series (𝑦 ) , 𝑦 & , … , 𝑦 1A) ) with 𝑦 ? ∈ ℝ, the current and the past values of n other input series (𝑋 ) , 𝑋 & , … , 𝑋 1 ) with 𝑋 ) ∈ ℝ D , the non-linear models would aim to learn the non-linear function 𝐹 that maps those values with the target current value 𝑦 1 , here, 𝑦̂ 1 represents the predicted value at time step 𝑡.

22 To learn the non-linear function 𝐹, many models have been proposed, for example, Hushchyn et al [27] used a simple ANN, Brockwell et al [28] used ARMA, Gao et al [29] used NARX, and so on Although these models can give reasonable results, there are grounds for optimism since the computational capabilities increase, more complex models become amenable to handle complicated relationships within datasets.

Language modeling

Although some other Deep Neural Networks (DNNs) are notably powerful machine learning models that achieve excellent performance on various problems in Natural Language Processing such as speech recognition [30], [31], DNNs can only be effectively applied to some problems whose inputs and output can be encoded with vectors or fixed dimensionality [32] In fact, many critical problems are expressed with the sequences whose lengths are unknown or arbitrary, for example, question answering, machine translation, and so on, which gives birth to sequence-to-sequence models The first sequence-to-sequence model was introduced in the field of neural machine translation in the study of Sutskever et al [32], which address the problem of sequence learning

Commonly, a sequence-to-sequence model contains 2 sub-modules: encoder and decoder The encoder encodes the input sequence into a context vector which will be decoded by the decoder to generate the output In some later studies, sequence-to- sequence model is also widely applied in other applications such as speech recognition [33], text summarization [34], [35], [36] In addition, some variants of encoder- decoder architecture have also been proposed, which differ in the conditional input and the type of core networks, for example, Bahdanau et al [37] and Luong et al [38] using RNNs

Currently, long short term memory networks [39] and gated recurrent units [40] are commonly used as recurrent networks in encoder-decoder architecture since both of them allow the model to memorize information from the previous steps and capture

23 the long term dependencies Besides, some models also use convolutional neural networks to cope with large datasets For instance, Gehring et al [41] introduced the first fully convolutional model for sequence-to-sequence learning, which can outperform recurrent models on large benchmark datasets

In an encoder-decoder model using RNN, the encoder would sequentially encode the input sequence 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑇 ) into fixed size vectors ℎ (ℎ 1 , ℎ 2 , … , ℎ 𝑇 ), also known as hidden states, with ℎ 1 = 𝑓(ℎ 1A) , 𝑥 1 ) The last hidden state would be used as the context vector, which would be decoded in the same manner to generate the output Here, the attention mechanism allows the model to create shortcuts between the entire input sequences as well as the output and the context vector The weights of the shortcut connections are customizable for each output element As it can achieve significant results in the field of language modeling, several variants of attention mechanism have been proposed to address some specific problems These mechanisms include content-based attention [42], addictive (also called “concat”) attention [37], location attention [38], scaled dot product attention [43], and so on

Below is a brief summary of several popular attention mechanisms and corresponding alignment score functions:

𝑛 (7) where 𝑠 1 , 𝒉 1 are the cell state and the hidden state of the RNN model at time step 𝑡, 𝑛 is the dimension of two vectors 𝑠 1 , 𝒉 1 Then, 𝒗 T and 𝑊 T are the parameters that the model has to learn

Recently, Cheng et al [19] proposed the self-attention (also known as intra- attention), a mechanism that performs shallow reasoning with memory and attention

In contrast to inter-attention, self-attention requires the model to compute the attention score of different positions within a single sequence In fact, it has been successfully applied in a variety of tasks including language modeling [19], sentiment analysis [19], natural language inference [19], [20], abstractive summarization [21], learning task-independent sentence representation [22], and so on

One of the limitations of models belonging to the RNN family is that they sequentially compute each time step, which leads to long training and inference time

Much effort has been made to address this issue like replacing the RNNs with very deep CNNs to capture the long-term dependencies For instance, Gehring et al [41] investigated convolutional layer for sequence-to-sequence tasks, Zeng et al [44] exploited a convolutional deep neural network to extract lexical and sentence level features, Conneau et al [45] applied very deep convolutional nets to text processing

Unlike those approaches, in the recent study, Vaswani et al [43] proposed a novel model called Transformer As common sequence-to-sequence models, the Transformer’s architecture is built based on encoder-decoder structure However, its encoder is composed of N stacked identical layers Each layer has two sub-layers which are the self-attention layer and the fully connected feed-forward layer In the same manner, the decoder consists of M stacked identical layers, but its elemental layer has 3 sub-layers In addition to the two sub-layers in the encoder, the decoder

25 has an encoder-decoder attention layer to perform attention on source sequence representation As entirely eliminating recurrent and convolutional connections as well as applying the self-attention mechanism to capture the long-term dependencies, the Transformer has been proved to reach new state-of-the-art in translation quality

In short, this Chapter has briefly summarized some background knowledge related to not only weblog mining but also time series prediction and several advanced techniques in the field of language modeling To have a better understanding of how they can be applied in this work, the next Chapter will provide some related works

RELATED WORK

Weblog Mining

In recent decades, weblog mining is an endless research area which has attracted extensive community attention As mentioned in Liu’s study [23], weblog mining can be categorized into three sub-domains which are web content mining (WCM), web structure mining (WSM) and web usage mining (WUM) Since different websites have different weblog structures and most of the studies mainly focus on analyzing particular log files and propose appropriate approaches, there are a few considerable works Here, I only provide some typical research related to my thesis

In the term of web content mining, many efforts have been made to address analyzing user access pattern from weblogs For instance, in Spiliopoulou’s work [47], [48], a mining system called Web Utilization Miner was used to extract interesting navigation pattern from weblogs At this time, it was proven to satisfy the expert’s criteria by exploiting an innovative aggregated storage representation for the information in the logs of a real web server The authors also proposed their own mining language (called MINT) which supports the specification of statistical, structural and textual nature to build the system However, the web contents and weblog structures are changing gradually, those techniques are no longer suitable for

27 analyzing weblogs nowadays That’s also the reason why the authors emphasized the importance of data preparation or so-called data pre-processing in mining weblog

Recently, Alfaro et al [49] combined supervised machine learning algorithms and unsupervised learning techniques for sentiment analysis and opinion mining They proposed a multi-stage method for the automatic detection of different opinion trends based on analyzing weblogs

In the scope of web structure mining, many algorithms and techniques have been proposed such as frequent pattern growth (FP-growth) [50], association rule mining (ARM) [51] to attract valuable information about the user from weblogs In a study of Iváncsy et al [52], some FP mining techniques were proposed to explore different types of pattern in the weblogs Giving the information on problems arising to the users, mining frequent patterns from weblogs is of great importance to optimize the web structure of a website and improve the performance of the whole system For example, Perkowitz et al [53] investigated the problem of index page synthesis In order to create an adaptive website, the authors proposed a mining cluster to find the collections of cohesive pages and then gather them into the same group by applying their PageGather algorithm Based on that, the system was able to automatically generate pages system that facilitates the visitors’ navigation within the website

In a study of Wang et al [54], an enhanced algorithm called weighted association rules (WAR) was proposed It assigns a numerical attribute for each item and judges those weight in a particular domain The latter study [55] inspirited by the same idea of WAR addressed the problem in discovering binary relationships in transaction dataset in weighted settings, which was proven to be more scalable and efficient

However, it has been shown that weblogs, in general, are sparse, and also have arbitrary length patterns, that makes a conventional algorithm difficult to mine the user access patterns Therefore, Sun et al [56] presented an algorithm named Combined Frequent Pattern Mining (CPFM) to address the problem As the algorithm

28 is the combination of the candidate-generation-and-test approach [57] and the pattern- growth approach [58], it can adaptively mine both long and short frequent patterns

In the context of web usage mining, there are many studies focusing on not only users accessing but also usage pattern of web pages Basically, they can be categorized into two sub-domains which are predictive and descriptive [59], [60]

In the descriptive domain, data is classified or characterized to extract useful knowledge For example, Zhang et al [61] proposed the Self Organizing Map (SOM) model or so-called Kohonen neural network to discover user groups in real time based on the users’ session extracted from weblogs As a result, they can effectively recommend suitable web links or products that their users may be interested in for each group In another study of Das et al [62], the user accessing pattern was extracted by using a model call Path Analysis Model Specifically, the model provides a count of the number of times that a link appears in the log then applies some association rules to understand the user navigation Based on that, they could improve the impressiveness of the website

In the predictive domain, many efforts have been made to address the problem in predicting user behavior Recently, Neelima et al [63] attempted to analyze the user behavior based on the amount of time that they spend on a particular page The user sessions extracted from weblogs are also used for predicting purposes In a study of Wang et al [64], an Unsupervised Clickstream Clustering model was proposed to capture dominating user behavior from clickstream data Unexpectedly, the authors found that the model was also able to predict future user dormancy In addition, some studies related to predicting the popularity of online contents will be discussed in the later section

Time Series Prediction

In the last decade, time series prediction algorithms have also been extensively researched and applied to solve many critical problems across various areas For instance, financial market prediction [5], weather forecasting [6], predicting the next frame of a given video [7], complex dynamical system analysis [8], and so on

Basically, time series prediction is the use of a model to predict future values based on previously observed values In other words, the predicting involves taking models to fit on historical data, then applying them to forecast the future values of the input sequences Time series prediction can be considered as a basic problem in many domains, with wide-ranging and high-impact applications For a long time since the well-known autoregressive moving average (ARMA) model was first proposed, the model and its variants [65], [28], have been proven to be effectively applied in various real-world applications However, these models are not able to model the non- linear relationships as well as differentiate among input sequences Then, the nonlinear autoregressive exogenous (NARX) [66] approach was proposed to address such a problem Over time, to make the approach becomes more flexible, many improvements have been made, such as Gao et al [29] proposed a nonlinear autoregressive moving average with exogenous inputs (NARMAX) model to improve predictive performance using fuzzy neural network; Diaconescu et al [67] exploited NARX to make prediction on chaotic time series; and so on Here, the basic idea still is utilizing a neural network to learn the non-linear relationships mapping the previous values of input sequences and the target sequences.

Although many other efforts have been made to address the time series prediction problem, such as kernel methods [68], ensemble methods [69], Gaussian processes [70], and so on, most of these approaches employ a predefined non-linear form Thus, they may not appropriately capture the actual non-linear relationships among the input series [71] The development of deep learning has resulted in many superior neural network models, including Recurrent Neural Networks (RNNs), a type of deep neural network that is successfully applied in sequence modeling RNNs have

30 also been extensively researched and they have received a numerous amount of attention as they are very flexible in capturing the non-linear relationships

However, the traditional RNNs usually suffer from the gradients vanishing problem [72], which makes the models hard to capture the long-term dependencies

To partially overcome this drawback, the Long Short Term Memory unit, also known as LSTM, was proposed in Hochreiter’s study [39] and the gated recurrent unit (GRU) was proposed in Cho’s study [40] They achieved substantial success in various application in the field of neural machine translation (NMT) Recently, Qin et al [71] successfully applied the advances of LSTM as well as the attention mechanism to address the time series prediction problem Although achieving a new state-of-the-art in time series prediction, the Dual-stage Attention-based Recurrent Neural Network (DA-RNN) seems to heavily rely on the LSTM, which contains enormous recursive computations.

Predicting content popularity

The field of predicting online content popularity was pioneered by the initiative of Szabo et al [2] The study showed some proofs of a strong linear correlation between the long-term popularity and the early popularity on the logarithmic scale

Based on that, the authors proposed a simple log-linear model to predict the overall popularity of a given online content via the early observation The proposed model was evaluated on various datasets, including Youtube videos [16], Digg stories [2], and so on Inspired by this ideal, Pinto et al [3] provided two enhanced models called Multivariate Linear model and MRBF model Using daily samples of content popularity measured up to a given reference date, these models are able to make predictions with reasonable accuracy on Youtube dataset [16]

In recent research, Li et al [4] introduced a novel model that is able to capture the popularity dynamics based on the early popularity evolution pattern as well as the popularity burst detection In this work, the authors consider not only some basic early popularity measurements but also the characteristics of individual video and the popularity evolution patterns as the input of their model In addition to the regression-

31 based method, some other techniques such as reservoir computing [73], time series analysis [74] are also applied to improve the performances

Despite achieving initial results, these aforementioned studies mainly focused on predicting the long-term popularity of the given content To address the problem in both long-term and short-term, the recent works of Hushchyn [27] and Meoni [75] proposed some simple artificial neural networks (ANNs) to predict the popularity of scientific datasets at the Large Hadron Collider at CERN Since these models are not robust enough to accurately predict the popularity of a particular item, they are mainly used for classifying purposes Therefore, accurately predicting the popularity of online contents in the near future is not a trivial task

In summary, this Chapter provides an overview of some studies that have been done in extracting knowledge from log files as well as predicting the popularity of contents As those studies provide an outline as well as insights into the research domains, they can be exceedingly applied to address the problems proposed in this thesis Based on that, the next Chapter will discuss the processing and analysis on some real datasets such as weblog of the HCMUT’s website, MovieLens, and Youtube

LOGFILE PROCESSING & ANALYSIS

Weblog of HCMUT Website

Basically, weblogs are categorized into three groups which are client log file, proxy log file, and server log file The HCMUT log file is a server log recording all user’s sites accessed while interacting with the HCMUT website from November 2016 to the end of October 2017 To extract useful knowledge from this data, I apply a web usage mining process which contains data pre-processing, and some pattern analysis In this dataset, I consider the number of access as the popularity of particular content

Data pre-processing is an essential task in every data mining applications as it takes responsibility for generating data in a suitable format, where statistics and mining algorithms can be applied In fact, log data always contains a lot of meaningless information and noise so data pre-processing may occupy about 80% of the mining process, according to Z Pabarskaite [77] Moreover, weblogs are collected from multiple sources within multiple channels, they usually have inconsistent formats

Table 4 1 Structure of HCMUT weblog (ECLF)

This is the IP address of the client (remote host) which made the request to the server.

The “hyphen” in the output indicates that the requested piece of information is not available In this case, the information that is not available is the RFC 1413 identity of the client determined by identity on the client's machine

This is the userID of the person requesting the document as determined by HTTP authentication

The time that the server finished processing the request

“GET /apache pb.gif HTTP/1.0”

The request line from the client is given in double quotes The request line contains a great deal of useful information First, the method used by the client is GET Second, the client requested the resource /apache pb.gif, and third, the client used the protocol HTTP/1.0

This is the status code that the server sends back to the client

The entry indicates the size of the object returned to the client, not including the response headers

“http://www.example.com/start.html”

The “Referrer” (sic) HTTP request header This gives the site that the client reports having been referred from

The User-Agent HTTP request header This is the identifying information that the client browser reports about itself

34 Since the HCMUT’s weblogs are collected from a single server, all the records are stored following the Extended Common Log Format (ECLF), a semi-structured format The structure of each record is described in Table 4 1 [78] § Data cleaning:

As the first step of data pre-processing, it reduces the size of the dataset significantly Fundamentally, in this step, all the meaningless records would be removed from the data Table 4 2 [78] provides some examples of eliminations All the red records are removed from the dataset Specifically, requests referring to some files such as style files, images may not provide useful knowledge in my case study, hence they are all eliminated Moreover, the failure requests that have the response status “error” are also removed The experimental measurement has shown that the cleaning process reduced about 46% size of the data, that would significantly lower the computation cost for the next step

Table 4 2 Example of data cleaning

1 www.hcmut.edu.vn/ www.google.com

2 www.hcmut.edu.vn/vi/ www.hcmut.edu.vn/

3 /includes/css/hcmut/welcome.css www.hcmut.edu.vn/vi/

4 /includes/css/hcmut/images /cell_a_bg.png www.hcmut.edu.vn/includes /css/hcmut/welcome.css

5 /includes/css/hcmut/images /tintuc/top menu bg.png www.hcmut.edu.vn/includes /css/hcmut/welcome.css

6 /includes/css/hcmut/nivo slider.css www.hcmut.edu.vn/vi/

7 /vi/newsletter/view/su-kien www.hcmut.edu.vn/vi/newsletter/

Similar to data cleaning, this step can substantially reduce the size of the dataset Since I mainly consider the popularity of a given content, in other words, its number of access, I only preserver several attributes which may be interesting in the mining process, for example, IP address, Date Time and Referrer As most of the contents on the HCMUT website are public, most of the requests referring to these contents do not contain any authentication information Relying only on the IP address is not enough to determine the actual user access while each access may cause many lines in the weblog Hence, attributes like Date Time and Referrer are of great importance to further investigate the accessing patterns § Access Identification:

Basically, each pageview on the website is always a collection of web objects and resources In fact, a pageview usually represents a specific user event, such as clicking on a link, opening an options panel, accessing content, and so on That is why each access often corresponds to several lines in log files Thus, the access identification process is applied to the weblog to determine the actual user accesses In addition, each request to an object or resource is represented in the Uniform Resource Identifier (URI) format, which is described in Fig 1 [78]

By eliminating query and fragment, I obtain a dataset shown in Table 4 3 [78]

Again, I remove all red records as they represent the listed access In particular, all requests are grouped by IP address and sorted by Date Time Within each group, a request r is considered as a new access A i if it satisfies one of two following conditions: (1) the Referrer of r is different from the Referrer of the access A i-1 (a previous access), (2) t 1 – t 2 > θ, where t 1 is the timestamp of r, t 2 is the timestamp of http://www.hcmut.edu.vn/vi/newsletter/view/su-kien/?tagartment&offices=newest#top

Scheme Authority Path Query Fragment

A i-1 and θ is a threshold determined by experiment Otherwise, r would be removed from the dataset § Data Transformation:

Fundamentally, data transformation is the process that converts data from one structure into other structures where the mining algorithms can be applied Since the weblog of HCMUT website includes the IP address of the users, it is simple to get more information about the geographic location of users accessing the website

Toward this goal, I apply the Client IP Location Lookup model [78] to produce three more attributes, which are country, city and the Internet Service Provider (ISP) The details of the model are described in Fig 2

Table 4 3 Example of Access Identification

Line IP Address Date Time Referrer

1 112.197.177.55 [06/Nov/2016:19:50:31 +0700] http://www.hcmut.edu.vn/vi 2 112.197.177.55 [06/Nov/2016:19:50:31 +0700] http://www.hcmut.edu.vn/vi 3 112.197.177.55 [06/Nov/2016:19:50:32 +0700] http://www.hcmut.edu.vn/vi

4 112.197.177.55 [06/Nov/2016:19:50:32 +0700] http://www.hcmut.edu.vn/vi

/newsletter/view/su-kien 5 172.28.2.3 [06/Nov/2016:19:50:32 +0700] http://www.hcmut.edu.vn/vi 6 172.28.2.3 [06/Nov/2016:19:50:32 +0700] http://www.hcmut.edu.vn/vi

7 112.197.177.55 [06/Nov/2016:19:50:32 +0700] http://www.hcmut.edu.vn/vi

… 20 112.197.177.55 [06/Nov/2016:19:52:04 +0700) http://www.hcmut.edu.vn/vi

Initially, all client IPs are considered as unknown IPs They will be looked up one-by-one by means of using several public APIs such as https://extreme-ip- lookup.com/json/ and http://ip-api.com/json/ The IP addresses found will be updated into an offline database for using later Based on that, I not only save time for looking

37 up a massive duplicated IP addresses but also attain an enormous database about IPs and locations Besides, there are some private IPs that cannot be looked up then they will be stored in a separated database to be updated manually

Fig 2 Client IP Lookup Model

3 Islamic Republic of Iran Iran

Client IP Location IP Location Lookup API Online

Date Time Referrer Country City ISP

Since the data obtained from the data transformation process are derived from many different sources, they are often unreliable and have inconsistent formats To address such a problem, data normalization is commonly applied as a supplementary process in the data pre-processing task Table 4 4, Table 4 5 and Table 4 6 show several examples of normalizing data [78]

1 Vietnam Post and Telecom Corporation VNPT-VN

2 Vietnam Post and Telecom Corporation

3 Vietnam Posts and Telecommunications Group VNPT-VN

4 Vietnamnet-No 4 Lang Ha Ha Noi VIETNAMNET-VN

39 In short, the whole data pre-processing process is summarized in Fig 3 As a result, the processed dataset contains more than 74 million records which represent about 10 million access to the HCMUT website over a year Moreover, I also attain more than 780 thousand different IP addresses and their locations stored in the offline database

Fig 3 Data Pre-processing process

MovieLens Dataset

As the MovieLens datasets are very famous since first released in 1998, they are widely used in research, education, and industry Basically, they provide various datasets presenting people’s preferences for movies from 1998 up to now [17] In this thesis, I only use the MovieLens 20M Dataset, which is their most stable benchmark dataset It contains about 20 million records and 465.000 tag applications applied to 27.000 movies by 138.000 users Specifically, each record takes the form of tuple representing a user’s preference at a particular time

Since this dataset contains the rating data of movies, I consider the number of ratings as the popularity of each movie

Fundamentally, the whole dataset has been processed and stored in the CSV file format, a structured data type Since the dataset has been extensively researched for decades, in this section, I only provide some analysis to extract useful knowledge that can be used for the later experiments

Here, Fig 11 plots the distribution of the movies’ popularity within the whole dataset It can be observed that about 90% of contents in this dataset have lower than 1000 accesses As MovieLens data has been collected for a long time, some old movies only get a few ratings or a few accesses The metric on Fig 11 also shows that there are about 15% contents only get one access

Fig 11 The proportion of content popularity in MovieLens dataset

As mentioned in X Cheng study [79], the distribution of the popularity of videos content is commonly represented by a gamma distribution [80], but gamma distribution is diverse with many different values of its parameters like α and β In this case, the overall popularity of movies in the MovieLens dataset fits very well with a gamma distribution, where α and β equal to 6.56e-4 and 4.25 respectively

For a better understanding, I also investigate the popularity distribution of those movies at a specific timestamp The obtained results show that the distribution also well fit the gamma distribution However, the parameters of each gamma line at each timestamp have a vast difference compared to the others The example is shown in Fig 12 where α is 0.165 and β is 1.631

Fig 12 Movies popularity distribution at a specific timestamp

Fig 13 The CDF of movies popularity in MovieLens dataset

In Fig 13, I consider the cumulative distribution function (CDF) of the popularity within the whole dataset Actually, the total popularity of all content is about 20 million accesses since I am using the MovieLens 20M Dataset However, it can be observed in Fig 13 that 20% of the movies account for lower than 10 thousand accesses, about 0.05% total access The cumulative accesses of 80% movies is only one million, 5% total access

Ac ce ss ed c ou nt

Di st rib ut io n

Fig 14 Movies' lifetime in MovieLens dataset

From the above observations, it can be concluded that the user’s preferences within MovieLens dataset are also allocated in a skewed fashion, since most of the movies get a few accesses, and a few others receive massive user attention

By grouping the number of access of each video by day, I can also estimate the lifetime of each individual movie across MovieLens dataset Specifically, I denote the lifetime of a given movie is the sum of the periods of time in which this movie is attractive to users, in other words, it receives sufficient accesses within those periods

If the movie m cannot get enough daily accesses in day k, then day k is not considered in the lifetime of m The adequate daily accesses Ada m of the movie m is defined as the following formula:

𝑛 ) (8) where n is a number of days during the observation period, N v is the total number of accesses that the movie m receives in n days 𝜃 and 𝜆 are two thresholds, indicating the minimum values in both absolute and relative terms of daily accesses In this analysis, I set 𝜃 = 5 and 𝜆 = 1.5, that means the adequate number of accesses of each

Di st rib ut io n

48 video much be the minimum of 5 and its average number of accesses within n days It can be seen in Fig 14 that about 80% of the movies have a very short lifetime, from 1 to 10 days Nearly 90% of the movies have a lifetime shorter than 250 days, whereas the longest lifetime is about 1340 days

Since most of the movies in MovieLens dataset have a very short lifetime, further investigations should be taken place on some that have a long lifetime That would significantly reduce the size of the dataset as well as redundant computations

To effectively predict the popularity of the movies, understand the user access evolution pattern can be considered as one of the essential steps According to C Li [4], the temporal growths of video popularity are complicated, for some videos, the popularity may increase rapidly in some short periods, also known as bursts

Meanwhile, for some other videos, the popularity increases steadily Moreover, it is common for a video to have more than one burst periods Hence, I used the number of burst periods to define the popularity evolution patterns which describe the differences in movies popularity evolution trends

In details, I denote 𝐼 \ a is the number of accesses of movie m in day k th , 𝑁 \ a is cumulative accesses of movie m up to k th day

Then, the burst periods are detected by the following formula:

(10) where 𝒔 𝒎 𝒌 is the growth state of video m at k th day, 𝜹 is a threshold Here, I set 𝜹 = 3

Then, 𝒔 𝒎 𝒌 equal to zero means the popularity of movie m in day k th increases steadily

Otherwise, day k th is considered in a burst period

After applying these formulas on the whole dataset, the obtained results are the sequences of the growth state of all movies By counting the number of state changes,

49 I can count the number of burst periods for each movie within the dataset Table 4 7 shows the proportion of movies that have different number of burst periods within the top 5000 most popular movies across the MoviesLens dataset

Table 4 7 The proportion of different evolution patterns

Youtube Dataset

Basically, Youtube is one of the biggest video hosting services Up to 2018, Youtube has more than 30 million active users, and about 5 billion videos watched on a daily basis In fact, Youtube is ranked as the second-most popular site in the world, according to Alexa Internet [81] In this thesis, I use two datasets collected from Youtube via its application programming interface [16]

The first one was collected by X.Cheng et al [76] in 2008 The authors considered all the Youtube videos as a directed graph, and each video is a node in this graph The direct edge from node a to node b represents that video b is listed in the top 20 most related videos of video a Then, the authors applied a breadth-first search to find all video in the graph Given the video IDs, they collected the statistic records of these videos periodically almost every 2 days in the two periods, from February 22 nd , 2007 to May 18 th , 2007; and from March 27 th , 2008 to July 27 th , 2008 Since the dataset collected in 2008 is more exhaustive and reliable, according to the author, I only use this dataset on my experiments The dataset contains 59 data points of more than 5 million different videos The format of the data is described in Table 4 8

With the Youtube dataset, the views count of a given video represents its popularity Here, Fig 15 shows the proportion of videos coupled with their popularity on March 27 th , 2008 As Youtube is one of the most popular sites in the world, there are about 50% of the collected videos having views count more than 10 thousand

Although the distribution line in Fig 15 is different from the one representing the popularity of movies in Fig 11, it can also be well fit into a gamma distribution

Applying gamma distribution on all data collecting timestamp, I obtain the wide-range of α and β

Besides, the cumulative view is also considered to show how user attention distributes within the graph of related videos Fig 16 depict the distribution on March 27 th , 2008 With more than 37 billion views in total, 90% of these views are observed

51 on only 18% of videos Approximately 63% of unpopular videos account for one billion views

Table 4 8 The format of the Youtube dataset (2008)

Video ID An 11-digit string (unique for each video) Owner A string representing the username of video’s owner

Age The number of days between the date that the video was uploaded and the Youtube’s establishment date

Category Video category (chosen by the owner) Length Length of the video (in seconds) Views The current number of views

Rating The current average rating point Comments The current number of comments Related IDs List of related video IDs (up to 20 IDs)

Fig 15 The proportion of videos popularity in Youtube dataset

Fig 16 The CDF of views in Youtube dataset

In the recent study, Nakajima et al [82] used the gamma distribution as the popularity distribution of video contents in Telco-CDN Based on that, they generated the user request to evaluate their caching algorithms However, in their experiment, they fixed the parameters of gamma distribution for all days In fact, those parameters are diverse, and they differ between each day Fig 17 and Fig 18 show the α and β observed corresponding for 59 data collecting points Specifically, α appears in a range from 1e-8 to 1e4 and does not seem to follow any regulation Showing up in another pattern, β even has a broader range, from 1e-22 to 1e8

Although this dataset contains information of more than 5 million videos, there are almost no videos recorded over 10 data collecting points or more as the videos are selected randomly Thus, this dataset is suitable for analyzing the popularity of videos on Youtube, but it is almost impossible to use for discovering the video access patterns or making the prediction

Fig 17 The Alpha parameter of gamma distribution over the observation

Fig 18 The Beta parameter of gamma distribution over the observation

Aiming at analyzing the access pattern as well as predicting the popularity of a given video in the near future, I use the same crawler [16] to collect data more frequently in several 10-day periods, February 21 to March 2, 2019; March 2 to March 12, 2019; March 12 to March 22, 2019 Due to the limited policies of Youtube, each data set contains hourly records of only 50 videos/country These videos are the most popular contents at the beginning of the collecting process within 20 countries

54 including the US, Japan, Russia, China, India, and so on, where most of the visitors of the site are located [81] In summary, my dataset contains the views count of more than 2500 different videos The average number of views for those 50 videos in each country is about 430 million views per hour However, the views count of some videos does not appear to be updated more often than once an hour, the experiments are only conducted on some subsets that have views count regularly updated The format of this dataset is described in Table 4 9

Table 4 9 The format of Youtube dataset (2019)

Video ID An 11-digit string (unique for each video) Channel ID A unique digit string representing the Channel releasing the video Timestamp The time at which the information is recorded

Category Video category (chosen by the owner) Duration Length of the video (in seconds) Views The current number of views Likes The current number of likes Dislikes The current number of dislikes Comments The current number of comments Published At The date that the video was published

Considering the hourly views count of top 50 popular videos in some countries, Fig 19 shows that Japan is the country whose videos attract user attention much higher than in others Specifically, top 50 videos in Japan got more than 1380 million views per hour during the observation, whereas top 50 videos in the US, China, Korean and Vietnam only received 734, 693, 612 and 168 million views/hour respectively Notably, the top 50 Vietnamese videos achieved higher views count than the Indian videos, although the number of users in Vietnam is much lower than India [81] One of the explanations for this could be that the videos in one country may

55 attract a lot of user attention from the others That is also the reason why Japanese videos received such a high number of views

Fig 19 Hourly views count of top 50 popular videos in several countries

Fig 20 Percentage of views in a given hour of a day in Japan

Besides, the active time of users is also different in each nation By calculating the rate of views in a given hour of the day, Fig 20 shows that the Japanese users are active in almost every time of the day, and there is only a slight difference between daytime and nighttime Meanwhile, users in the US seem to be more active in the day time, which is depicted in Fig 21

Fig 21 Percentage of views in a given hour of a day in the US

In contrast, users in Vietnam, Singapore, and some other Asian countries tend to be more active during the night time, especially from 9 pm to 4 am in the next morning Here, Fig 22 presents the distribution of users access within a day in Vietnam and Singapore

Fig 22 Percentage of views in a given hour of a day in Vietnam and Singapore

57 Moreover, with this dataset, I also investigate how user access distributes among days within a week As described in Fig 23, the number of views that the most popular videos in both the US and Japan received are higher on weekdays than the weekend, especially on Tuesday and Wednesday This fact is also observed in data collected from other countries

Fig 23 Percentage of view in a given day of a week in the US and Japan

In summary, the Youtube dataset (2019) contains the hourly views count of more than 2500 popular videos on Youtube, one of the biggest video hosting services

Since the views counts of all video are collected simultaneously and seamlessly during several 10-day periods, this dataset is suitable for analyzing the access pattern as well as developing algorithms to predict the popularity of online contents The above analysis has discovered some of the characteristics of the dataset, such as the daily and weekly access patterns In the next Chapter, I will provide two novel models to predict the popularity of online contents

PREDICTING MODEL

Derivative-based Multivariate Linear Regression

Since the field of predicting the popularity of online contents was pioneered by the initiative of Szabo and Huberman [2], several later research [3], [4] have successfully developed linear regression models to predict the popularity of online contents based on their observation that is a strong linear relationship between the long-term popularity and the early popularity on the logarithmic scale In this section, I also propose a linear regression model, called derivative-based multivariate linear regression (DMLR), to address predicting short-term popularity Assuming that the popularity of a given content is a polynomial function of time: P = f(t), P can also be expressed as Taylor series [83] as follow:

(11) where 𝑃 1 is the popularity of a given content at time step 𝑡, 𝑓 D (𝑎) denotes the 𝑛 1s derivative of 𝑓 evaluated at the point 𝑎, and 𝑎 is a constant By choosing 𝑎 = 𝑡, the near future popularity of the content at time step 𝑡 ) = 𝑡 + 𝜀 (where 𝜀 is small) is obtained by the following formula:

59 In equation (12), 𝑓 𝑡 represents the popularity of the content at the previous time step, 𝑓 m 𝑡 = 𝜕𝑃 𝜕𝑡 expresses the current velocity or pace of change in the popularity of the content at time step 𝑡, 𝑓 mm 𝑡 = 𝜕 2 𝑃

𝜕𝑡 2 can be considered as the current acceleration of the popularity at time step 𝑡, and so on Thus, each derivative has its contribution or weight in calculating the popularity at a given time However, in equation (12), the derivative weighs are fixed as 1

𝑛! , which makes this equation becomes hard to fit the popularity pattern in the real datasets

Hence, by setting a different weight for each derivative, the equation (12) can be rewritten as follow:

(13) where C = {𝐶 r , 𝐶 ) , … , 𝐶 D } is the set of derivative weights

In conclusion, by applying the equation (13), I proposed a novel model called derivative-based multivariate linear regression to predict the popularity of online contents in the near future It takes a set of 𝑛 derivatives of the popularity at the current time step as the input, where 𝑛 is determined by experiments The evaluations of this model will be described in the next Chapter.

Attention-based Non-Recursive Neural Network

Since the popularity evolution patterns are diverse, they may be different on each dataset, even on different contents within the same dataset Moreover, the popularity of a given content can be related to the popularity of other contents in the same dataset In fact, some Youtube videos in the same category or having the same owner often have the same popularity evolution pattern Thus, in this section, I propose a new model which can simultaneously predict the popularity of several contents based on the relationships among their popularity

60 As the popularity of each content is stored in the form of a sequence of values over time, in other words, a time series, predicting the popularity of online contents can be considered as a time series prediction problem Besides, the recent study [71] has successfully applied LSTM coupled with attention mechanism in solving a specific time series problem It opens a great opportunity as well as a challenge in utilizing the state-of-the-art techniques of NLP to address some problems in time series prediction To increase accuracy as well as the efficiency, I introduce an attention-based non-recursive neural network (ANRNN), a novel model for predicting online content popularity Instead of predicting a single value, this model can predict n values of n input series at once The model’s structure is shown in Fig 24

Similar to most of the competitive neural sequence-to-sequence models, ANRNN is built based on the encoder-decoder structure [37], [40], [32] While other sequence-to-sequence modes are mainly used for translating a sequence of words from one language to another, ANRNN transforms sequences containing historical values of input series into sequences containing values of these series in the next time step, in other words, it predicts the popularity of online contents given their popularity in the past As described in Fig 24, the encoder of ANRNN is integrated with an input attention layer [71], which highlights the relevant driving series before they are fed into the core network The encoder then maps a weighted input E = {e 1 , e 2 ,…, e n } to an encoded values Z = {z 1 , z 2 ,…, z n }, where e i , z i are vectors containing the output of the input attention layer and the encoded values of i th driving series at T-1 timesteps, e i , z i 𝜖 ℝ yA) , and T is the size of the sliding window Given Z, the decoder can generate an output vector y T = {y 1 , y 1 , …, y n } which contains predicted values of n driving series at the last time step of the sliding window

Like the Transformer, the encoder of ANRNN is the stack of N identical layers

Each layer contains two sub-layers called self-attention and feed-forward layer At attention layer, the attention score is obtained by the following formula:

Fig 24 The structure of the attention-based non-recursive neural network where 𝑄 = 𝐸𝑊 ~ ; 𝐾 = 𝐸𝑊 • ; 𝑉 = 𝐸𝑊 € , with 𝑄, 𝐾 𝜖 ℝ yA) ×‚ ƒ and 𝑉 𝜖 ℝ yA) ×‚ „ are queries, keys and values matrices Then, 𝑊 ~ , 𝑊 • 𝜖 ℝ D×‚ ƒ , and

62 𝑊 € 𝜖 ℝ D×‚ „ are weight matrices that the model has to learn In this thesis, I set 𝑑 a 𝑑 _ = 64 [43]

Since some sequence-to-sequence model using recurrent neural network have to process all time steps sequentially, they may forget the previous parts after it finishes processing on the whole input Here, the self-attention mechanism is applied to this model as it allows the model to look at all others time steps in the driving sequences for clues to make a better encoding for the current processing time step In contrast to the recurrent neural network, the self-attention does not contain any recursive step, which makes it easy to be parallelized that significantly lower the training time After the self-attention layer is a feed-forward layer which is merely a fully connected ANN This layer consists of two linear transformation layers and a ReLU activation in between The feed-forward layer is applied to each time step identically but separately as it uses different parameters for each time step

In addition, each sub-layers are followed by a layer normalization [84], which is commonly used to normalize the activities of the neurons as a method to decrease the training time Specifically, the layer normalization computes the mean and variance from all of the summed inputs to the neurons in a layer on a single training case to normalize the neurons Besides, as the model is the stacked of many layers, the training become hard because of the gradient vanishing problem Therefore, residual connections [85] are employed around each of the two sub-layers to allow gradients to pass through a network directly without passing through non-linear activation functions

In the same manner, the decoder of ANRNN is also a stack of N identical layers, but each layer has an additional encoder-decoder attention layer between two others to perform attention on source sequence representation

In summary, the ANRNN is the combination of two attention mechanisms, it can pay attention to not only the most relevant driving series but also the critical time steps To further understand the model, the experiments are conducted on some real datasets, which will be discussed in the next Chapter

EXPERIMENT

Parameter Sensitivity

6.1.1 Derivative-based Multivariate Linear Regression

As described in equation (13), the DMLR model would consider n derivatives of the current popularity of a given content P t = f(t) to predict its popularity at the next time step This experiment is conducted to determine the appropriate number of derivatives that the model should consider By keeping track of the RMSE and MAE values corresponding to the number of derivatives, Fig 25 and Fig 26 show the empirical results in MovieLens and Youtube datasets respectively

According to the theory of Taylor series expansion [83], the more derivatives considered the higher accuracy that we get This fact is also obtained while my model experiment in both datasets It is apparent from the two figures that both RMSE and MAE decrease when the number of derivatives increases However, MAE values change negligibly while the change in RMSE values is noticeable, from 5.97 to 5.01 with the MovieLens dataset and from 10.02 to 9.13 with the Youtube dataset

Another observation is that the MAE values when practicing with the Youtube dataset are much higher than those values observed when practicing with the MovieLens dataset Furthermore, the RMSEs displayed in Fig 26 are almost twice as high as those shown in Fig 25 That means that the DMLR model can effectively

65 predict on the MovieLens dataset but not the Youtube dataset This problem will be discussed further in the next section

Fig 25 RMSE and MAE in predicting with MovieLens dataset

Fig 26 RMSE and MAE in predicting with Youtube dataset

66 Since further increasing the number of derivatives does not bring significant improvement in performance but requires more historical data for the derivative calculation, I choose n = 6 for later experiments

6.1.2 Attention-based Non-Recursive Neural Network

As ANRNN can be considered as an improved version of DA-RNN but for predicting online content popularity, in this experiment, I also use the NASDAQ 100 dataset which was used in the recent study [71] to evaluate DA-RNN performance

The dataset consists of the stock prices of 81 major corporations under NASDAQ 100

The data was collected minute-by-minute from July 26 to December 22, 2016 (105 days in total) Each day contains about 390 data points from the opening to the closing of the market

Specifically, I investigate the effect of ANRNN’s parameters on the predicting results by conducting a grid search over a wide range of the window size (ws), the number of stacked layers (N) as well as the hidden size (hs) of each attention layer

The results show that ws = 9 and ws = 3 give the best performance over the validation on NASDAQ 100 and Youtube dataset respectively Moreover, it can be seen in Fig

27 that the model is converged quickly after about 500 epochs when N = 1, and the convergence rate decreases as N increases Basically, increasing the number of layers involves increasing the complexity of the model, that would significantly inflate the computation cost Here, the model does not benefit from increasing the complexity, but it is even not converged after 2000 epochs when N is greater than 9

As showed in Fig 28, the hidden size does not appear to bring any noticeable changes when it is greater than 64 Repeating the experiment multiple times, it can be recognized that the model with the hidden size of 128 always produces the RMSE a bit lower than the one with the hidden size of 64 while training on the Youtube dataset, the opposite behavior is observed on the NASDAQ 100 dataset

In short, the ANRNN model only needs one encoder-decoder layer to make predictions on both datasets, and the performance would be worse as the number of layers increase Although setting the hidden size is not so sensitive, it needs to be

67 tested multiple times in order to choose the appropriate values In fact, as increasing the hidden size also increases the computation cost, we may consider the tradeoff between the accuracy and the complexity of the model

Fig 27 RMSE comparison among the different numbers of stacked layers

Fig 28 RMSE vs hidden size of attention layers

Time series prediction

In perspective of time series prediction, I compare the ANRNN model with the DA-RNN when practicing with NASDAQ 100 dataset to investigate how well these models follow the changes in each driving series Here, Fig 29 shows the predicted results in both train and test process of DA-RNN which was proved to be able to fit the ground truth much better than the four mentioned baselines when practicing on the whole dataset [71] Similarly, Fig 30 plots the predictions on about 750 data points of a single time series using the ANRNN model

Since both models fit very well on the training set, the RMSE and MAE are also reported to further evaluate their performance As seen in Table 6 1 the average RMSE in training set produced by DA-RNN is 3.46, while average RMSE achieved by ANRNN is only 1.53

In fact, both models use the input attention mechanism to adaptively select the most relevant input time series, the only difference between them is the second attention mechanism used to capture the long-term dependencies across previous time steps Specifically, DA-RNN applies LSTM-based recurrent neural network to encode and decode information in the input series This module generates a sequence of hidden states h = {h 1 , h 2 , …, h n }, in which h t is a vector storing input information of all input sequences until time step t Each vector is used to compute the next hidden state

This mechanism enables the model to accumulate information in previous steps before making predictions Meanwhile, the ANRNN uses the self-attention mechanism, which does not contain any recursive step This mechanism not only gives the model the ability to pay attention to the most critical dependencies among all time steps but also makes the model more parallelizable As learning attention weights better, ANRNN model shows its superiority in performance compared to DA-RNN This fact is further illustrated in the test set, where the RMSE of ANRNN and DA-RNN are about 6.06 and 8.52 respectively Thus, ANRNN model can outperform DA-RNN model in term of time series prediction on the average case However, there are some cases where data changes suddenly and contains many fluctuations due to noises, both

69 models are not able to fit on the ground truth Hence, they must be further investigated to improve these models

Fig 29 DA-RNN’s train and test prediction on NASDAQ 100 dataset

Fig 30 ANRNN’s train and test prediction on NASDAQ 100 dataset

Table 6 1 Average RMSE comparison between DA-RNN and ANRNN when practicing with NASDAQ 100 dataset

Algorithms Train set Test set

Predicting online content popularity

To demonstrate the effectiveness of the DMLR and ANRNN models in predicting content popularity, this experiment will discuss the predictability of both models on the MovieLens and the Youtube dataset in comparison to some baselines

Here, Fig 31, Fig 32, Fig 33 and Fig 34 plot the empirical results of the four models when experimenting with the MovieLens dataset On the one hand, the DMLR model is the only model that makes predictions based on only the previous access pattern without considering other series However, the DMLR model seems to give the best prediction in this dataset as it achieves the lowest errors in the test set

Considering Table 6 2, its RMSE and MAE are only approximately 0.60 and 0.40 It means that the popularity of a movie within this dataset is potentially represented by a function P t = f(t) On the other hand, other models predict the next value of a series in its relationship with other series As shown in Table 6 3, the mean correlation coefficient among all series within this dataset is about 0.68 – 0.83, which means they do have a strong linear relationship with each other While both ANRNN and DA-RNN models can give reasonable predictions, FC-ANN seems too simple to follow the changes in those series Moreover, most of RMSEs and MAEs achieved in the test set are higher than those achieved in the train set as the data in the test set is unknown to these models However, the RMSEs and MAEs conducted by the DMLR model in the test set are not only the lowest errors in comparison with other models but also lower than its errors achieved in the train set One of the explanations for this might be that the range of value in the test set is much smaller than in the train set

Fig 31 DA-RNN's train and test prediction on MovieLens dataset

Fig 32 FC-ANN's train and test prediction on MovieLens dataset

In summary, DA-RNN, DMLR, and ANRNN are three models that can accurately predict the movie popularity in the near future within the MovieLens dataset DMLR model gives the best results with RMSE and MAE are about 0.60 and

72 0.40 while ANRNN model achieves the second highest accuracy (1.56 and 1.20), followed by DA-RNN (1.76 and 1.46)

Fig 33 DMLR's train and test prediction on MovieLens dataset

Fig 34 ANRNN's train test prediction on MovieLens dataset

Table 6 2 RMSE and MAE comparisons between different methods when practicing with MovieLens dataset

Table 6 3 Mean correlation among series in 2 datasets

Since some videos in the same category or having the same owner may have the same popularity evolution pattern, the input sequences in the Youtube dataset often have linear relationships with each other As shown in Table 6 3, the mean correlation coefficient within Youtube dataset is about 0.40 to 0.53 Although it is nearly twice lower than the correlation coefficient of the MovieLens dataset, it is much higher than the one achieved in the NASDAQ 100 dataset, and it shows the strong relations among sequences within the dataset Moreover, as the data is collected more frequently, once an hour, it is easy to recognize that the popularity of most of Youtube’s contents have daily and weekly cycles, which are also mentioned in Szabo’s study [2] Specifically, depending on the locations, the users seem to be more active in some different periods of time within a day However, most of them are less active on weekends than on weekdays

74 Considering Fig 35, Fig 36, Fig 37 and Fig 38, it can be observed that FC- ANN and ANRNN generally fit the ground truth in the train set Meanwhile, the outcome of DA-RNN appears to be a delayed sequence of the ground truth, and DMLR’s predictions are affected by anomalies from the 30 th to the 50 th data point

Fig 35 DA-RNN's train and test prediction on Youtube dataset

Fig 36 FC-ANN's train and test prediction on Youtube dataset

75 For a better understanding of the performance, the RMSE and MAE values are also reported in Table 6 4 Although the FC-ANN gives an impressive result on the train set with the second lowest errors, 2.31 and 1.28 corresponding to RMSE and MAE, these errors achieved on the test set are much higher, 11.51 and 8.97

Fig 37 DMLR's train and test prediction on Youtube dataset

Fig 38 ANRNN's train and test prediction on Youtube dataset

76 Notably, DA-RNN gives the worst result since its RMSE and MAE are the highest errors among the four models in both train and test set The explanation for this could be that the RNN sequence-to-sequence models have not been able to attain a remarkable performance in small-data regimes [89] In contrast, the ANRNN model appears to be very effective as it conducts the lowest RMSE and MAE in both train and test set Again, the errors achieved by the DMLR in the test set is lower than those in the train set since there is not any substantial anomaly

Table 6 4 RMSE and MAE comparisons between different methods when practicing with Youtube dataset

This experiment is conducted on a personal computer with the following specifications:

• Processor: Intel Core i5, 5 th generation

To measure the time efficiency, all models are trained with the same hidden size, batch size, and window size In Fig 39, the orange bars describe the inference time of the four models when practicing with a subset of the Youtube dataset, which

77 contains 50 sequences and about 240 data points in each sequence The inference time of DA-RNN appears to be the highest number, about twice as long as the ANRNN, 10 times higher than the FC-ANN and much higher than the DMRL Meanwhile, the blue bars represent the inference time of these models in a subset of the NASDAQ 100 dataset, which contains 81 sequences and about 500 data points in each sequence As the data regime increases, the computation times of the DA-ANN and FC-ANN rise about 2.75 times while the computation times of ANRNN and DMRL are approximately 2.24 and 1.44 times higher than the previous experiement Notably, this experiment is run entirely on CPU, and the ANRNN has not been parallelized yet As the self-attention is highly parallelizable [43], the inference time of the ANRNN would be massively improved when being parallelized and running on GPU

Fig 39 The comparison of inference time among DA-RNN, ANRNN, FC-ANN, and DMRL in two datasets

In summary, the above empirical results have shown that ANRNN and DMLR can outperform the two baseline methods when practicing with the two real datasets

As the ANRNN relies entirely on input attention and self-attention mechanism to make predictions without using RNN or CNN, it not only improves accuracy but also reduces the computation time Moreover, the DMLR with a very low computation cost can also give reasonable results in practice

CONCLUSION & FUTURE WORK

Recently, computer systems have been generating an enormous amount of data, including log files Since log files can be considered as an informative and endless resource, mining log files has become an extensive research area As a result, many algorithms and techniques have been extensively researched to address some specific problems in handling, processing as well as analyzing log files Aiming to analyze weblogs, this thesis is expected to extract valuable information and knowledge from weblogs as well as be able to predict the popularity of online contents based on their historical accesses

Commonly, weblogs have various formats, and they may be generated from many sources, that makes processing weblogs become a challenging task Since each dataset has its own characteristics, pre-processing and analyzing those characteristics are of great importance to propose appropriate models to predict the popularity of the contents Within the scope of this thesis, the data processing as well as some analysis have been performed on the HCMUT weblogs, the MovieLens, and the Youtube dataset Specifically, the thesis not only describes carefully the processing process but also provides several investigations which reveal the users’ preferences over time as well as the distribution of the HCMUT website’s visitors based on their geographical locations For the MovieLens dataset, some other statistics are also provided for a better understanding of the popularity evolution pattern of movies within the dataset

For the Youtube datasets, as the data were collected more frequently, several measurements have shown some characteristics of the user access patterns in different countries such as the periodicity of contents’ popularity

As a part of the thesis, two machine learning models are proposed to address the problem in predicting online content popularity The first model called Derivative- based Multivariate Linear Regression is built based on Taylor’s expansion and linear regression while the second model, Attention-based Non-Recursive Neural Network,

79 is the combination of two state-of-the-art attention mechanisms which are widely applied in the field of natural language processing The experimental results have shown that the two new models outperform several baseline methods in overall and the inference time is also significantly improved

As data always contains noises, there are some cases that the proposed models cannot give reasonable performance Moreover, since there are inadequate historical records for some newly uploaded videos which have attracted the attention of a large number of users in a short period of time since uploaded, the proposed models may not be able to make reliable predictions However, there are grounds for optimism as the computational capabilities increase, and the rapid development of machine learning, deep learning has resulted in many superior mechanisms and techniques

Therefore, improving some existing models to overcome those limitations is considerable future work As each dataset has its own characteristics, it is hard to propose a general model which can effectively make predictions on all datasets This work will pioneer in building a framework that contains some superior models to predict the popularity of online contents across different datasets based on analyzing their characteristics

In summary, since the amount of data, especially video, that has been brought into the Internet is growing exponentially, knowing the popularity of contents in the future would be beneficial to a handful of applications Therefore, the results of this thesis can be widely applied in solving some real-world problems such as improving contents’ distribution, cache placement policies, and so on These applications make room for further improvement and integration of this study

[1] C V N I Cisco, "The zettabyte era—trends and analysis, 2015–2020 white paper," ed: July, 2016

[2] G Szabo and B A Huberman, "Predicting the popularity of online content,"

[3] H Pinto, J M Almeida, and M A Gonỗalves, "Using early view patterns to predict the popularity of youtube videos," in Proceedings of the sixth ACM international conference on Web search and data mining, 2013: ACM, pp

[4] C Li, J Liu, and S Ouyang, "Characterizing and predicting the popularity of online videos," IEEE Access, vol 4, pp 1630-1641, 2016

[5] Y Wu, J M Hernández-Lobato, and Z Ghahramani, "Dynamic covariance models for multivariate financial time series," arXiv preprint arXiv:1305.4268, 2013

[6] P Chakraborty, M Marwah, M Arlitt, and N Ramakrishnan, "Fine-grained photovoltaic output prediction using a bayesian ensemble," in Twenty-Sixth

AAAI Conference on Artificial Intelligence, 2012

[7] V Vukotić, S.-L Pintea, C Raymond, G Gravier, and J C van Gemert,

"One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network," in International Conference on Image

Analysis and Processing, 2017: Springer, pp 140-151

[8] Z Liu and M Hauskrecht, "A regularized linear dynamical system framework for multivariate time series analysis," in Twenty-Ninth AAAI Conference on

[9] C C Holt, "Forecasting seasonals and trends by exponentially weighted moving averages," International journal of forecasting, vol 20, no 1, pp 5-10, 2004

[10] S Hansun, "A new approach of moving average method in time series analysis," in 2013 Conference on New Media Studies (CoNMedia), 2013:

[11] L Harrison, W D Penny, and K Friston, "Multivariate autoregressive modeling of fMRI time series," Neuroimage, vol 19, no 4, pp 1477-1491, 2003

[12] E Masry, "Multivariate local polynomial regression for time series: uniform strong consistency and rates," Journal of Time Series Analysis, vol 17, no 6, pp 571-599, 1996

[13] Y.-S Lee and L.-I Tong, "Forecasting time series using a methodology based on autoregressive integrated moving average and genetic programming,"

Knowledge-Based Systems, vol 24, no 1, pp 66-72, 2011

[14] S Ling and W Li, "On fractionally integrated autoregressive moving-average time series models with conditional heteroscedasticity," Journal of the

American Statistical Association, vol 92, no 439, pp 1184-1194, 1997

[15] HCMUT "Ho Chi Minh city University of Technology : https://hcmut.edu.vn/." (accessed March 18, 2019)

Interface: https://developers.google.com/youtube/." (accessed March 18, 2019)

[17] F M Harper and J A Konstan, "The movielens datasets: History and context,"

Acm transactions on interactive intelligent systems (tiis), vol 5, no 4, p 19,

[18] Y Zhou, L Chen, C Yang, and D M Chiu, "Video popularity dynamics and its implication for replication," IEEE transactions on multimedia, vol 17, no

[19] J Cheng, L Dong, and M Lapata, "Long short-term memory-networks for machine reading," arXiv preprint arXiv:1601.06733, 2016

[20] A P Parikh, O Tọckstrửm, D Das, and J Uszkoreit, "A decomposable attention model for natural language inference," arXiv preprint arXiv:1606.01933, 2016

[21] R Paulus, C Xiong, and R Socher, "A deep reinforced model for abstractive summarization," arXiv preprint arXiv:1705.04304, 2017

[22] Z Lin et al., "A structured self-attentive sentence embedding," arXiv preprint arXiv:1703.03130, 2017

[23] B Liu, Web data mining: exploring hyperlinks, contents, and usage data

[24] R Cooley, B Mobasher, and J Srivastava, "Data preparation for mining world wide web browsing patterns," Knowledge and information systems, vol 1, no

[25] F E Tay and L Cao, "Application of support vector machines in financial time series forecasting," omega, vol 29, no 4, pp 309-317, 2001

[26] C.-J Lu, T.-S Lee, and C.-C Chiu, "Financial time series forecasting using independent component analysis and support vector regression," Decision

Support Systems, vol 47, no 2, pp 115-125, 2009

[27] M Hushchyn, P Charpentier, and A Ustyuzhanin, "Disk storage management for LHCb based on Data Popularity estimator," in Journal of Physics:

Conference Series, 2015, vol 664, no 4: IOP Publishing, p 042026

[28] P J Brockwell, R A Davis, and M V Calder, Introduction to time series and forecasting Springer, 2002

[29] Y Gao and M J Er, "NARMAX time series model prediction: feedforward and recurrent fuzzy neural network approaches," Fuzzy sets and systems, vol

[30] G Hinton et al., "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal processing magazine, vol 29, 2012

[31] G E Dahl, D Yu, L Deng, and A Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE

Transactions on audio, speech, and language processing, vol 20, no 1, pp 30-

[32] I Sutskever, O Vinyals, and Q V Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, 2014, pp 3104-3112

82 [33] J K Chorowski, D Bahdanau, D Serdyuk, K Cho, and Y Bengio,

"Attention-based models for speech recognition," in Advances in neural information processing systems, 2015, pp 577-585

[34] A M Rush, S Chopra, and J Weston, "A neural attention model for abstractive sentence summarization," arXiv preprint arXiv:1509.00685, 2015

[35] R Nallapati, B Zhou, C Gulcehre, and B Xiang, "Abstractive text summarization using sequence-to-sequence rnns and beyond," arXiv preprint arXiv:1602.06023, 2016

[36] S Shen, Y Zhao, Z Liu, and M Sun, "Neural headline generation with sentence-wise optimization," arXiv preprint arXiv:1604.01904, 2016

[37] D Bahdanau, K Cho, and Y Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014

[38] M.-T Luong, H Pham, and C D Manning, "Effective approaches to attention- based neural machine translation," arXiv preprint arXiv:1508.04025, 2015

[39] S Hochreiter and J Schmidhuber, "Long short-term memory," Neural computation, vol 9, no 8, pp 1735-1780, 1997

[40] K Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014

[41] J Gehring, M Auli, D Grangier, D Yarats, and Y N Dauphin,

"Convolutional sequence to sequence learning," in Proceedings of the 34th

International Conference on Machine Learning-Volume 70, 2017: JMLR org, pp 1243-1252

[42] A Graves, G Wayne, and I Danihelka, "Neural turing machines," arXiv preprint arXiv:1410.5401, 2014

[43] A Vaswani et al., "Attention is all you need," in Advances in Neural

[44] D Zeng, K Liu, S Lai, G Zhou, and J Zhao, "Relation classification via convolutional deep neural network," 2014

[45] A Conneau, H Schwenk, L Barrault, and Y Lecun, "Very deep convolutional networks for text classification," arXiv preprint arXiv:1606.01781, 2016

[46] Ş Gỹndỹz and M T ệzsu, "A web page prediction model based on clickstream tree representation of user behavior," in Proceedings of the ninth ACM

SIGKDD international conference on Knowledge discovery and data mining,

[47] M Spiliopoulou, L C Faulstich, and K Winkler, "A data miner analyzing the navigational behaviour of web users," in Proc of the Workshop on Machine

Learning in User Modelling of the ACAI99, 1999: Greece, July

[48] M Spiliopoulou and L C Faulstich, "Wum: A web utilization miner," in

International Workshop on the Web and Databases, Valencia, Spain, 1998:

[49] C Alfaro, J Cano-Montero, J Gómez, J M Moguerza, and F Ortega, "A multi-stage method for content classification and opinion mining on weblog comments," Annals of Operations Research, vol 236, no 1, pp 197-213, 2016

[50] R Mishra and A Choubey, "Discovery of frequent patterns from web log data by using FP-growth algorithm for web usage mining," International Journal of

Advanced Research in Computer Science and Software Engineering, vol 2, no

[51] Z Qiankun, "Association Rule Mining: A Survey, Technical Report," CAIS,

[52] R Iváncsy and I Vajk, "Frequent pattern mining in web log data," Acta

Polytechnica Hungarica, vol 3, no 1, pp 77-90, 2006

[53] M Perkowitz and O Etzioni, "Adaptive web sites: Automatically synthesizing web pages," in AAAI/IAAI, 1998, pp 727-732

[54] W Wang, J Yang, and S Y Philip, Efficient mining of weighted association rules (WAR) IBM Thomas J Watson Research Division, 2000

[55] F Tao, F Murtagh, and M Farid, "Weighted association rule mining using weighted support and significance framework," in Proceedings of the ninth

ACM SIGKDD international conference on Knowledge discovery and data mining, 2003: ACM, pp 661-666

[56] L Sun and X Zhang, "Efficient frequent pattern mining on web logs," in Asia-

Pacific Web Conference, 2004: Springer, pp 533-542

[57] R Agrawal and R Srikant, "Fast algorithms for mining association rules," in

Proc 20th int conf very large data bases, VLDB, 1994, vol 1215, pp 487-

[58] J Han, J Pei, and Y Yin, "Mining frequent patterns without candidate generation," in ACM sigmod record, 2000, vol 29, no 2: ACM, pp 1-12

[59] A Singh and K K Das, "Application of data mining techniques in bioinformatics," 2007

[60] F Bounch et al., "Web log data warehourseing and mining for intelligent web caching, J," Data Knowledge Eng, vol 36, pp 165-189, 2001

[61] X Zhang, J Edwards, and J Harding, "Personalised online sales using web usage data mining," Computers in Industry, vol 58, no 8-9, pp 772-782, 2007

[62] R Das and I Turkoglu, "Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method,"

Expert Systems with Applications, vol 36, no 3, pp 6635-6644, 2009

[63] G Neelima and S Rodda, "Predicting user behavior through sessions using the web log mining," in 2016 International Conference on Advances in Human

Machine Interaction (HMI), 2016: IEEE, pp 1-5

[64] G Wang, X Zhang, S Tang, H Zheng, and B Y Zhao, "Unsupervised clickstream clustering for user behavior analysis," in Proceedings of the 2016

CHI Conference on Human Factors in Computing Systems, 2016: ACM, pp

[65] D Asteriou and S G Hall, "ARIMA models and the Box–Jenkins methodology," Applied Econometrics, vol 2, no 2, pp 265-286, 2011

[66] T Lin, B G Horne, P Tino, and C L Giles, "Learning long-term dependencies in NARX recurrent neural networks," IEEE Transactions on

Neural Networks, vol 7, no 6, pp 1329-1338, 1996

[67] E Diaconescu, "The use of NARX neural networks to predict chaotic time series," Wseas Transactions on computer research, vol 3, no 3, pp 182-191, 2008

84 [68] S Chen, X Wang, and C J Harris, "NARX-based nonlinear system identification using orthogonal least squares basis hunting," IEEE Transactions on Control Systems Technology, vol 16, no 1, pp 78-84, 2007

[69] A Bouchachia and S Bouchachia, Ensemble learning for time series prediction na, 2008

[70] R Frigola and C E Rasmussen, "Integrated pre-processing for Bayesian nonlinear system identification with Gaussian processes," in 52nd IEEE

Conference on Decision and Control, 2013: IEEE, pp 5371-5376

[71] Y Qin, D Song, H Chen, W Cheng, G Jiang, and G Cottrell, "A dual-stage attention-based recurrent neural network for time series prediction," arXiv preprint arXiv:1704.02971, 2017

[72] Y Bengio, P Simard, and P Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol 5, no

[73] T Wu, M Timmers, D De Vleeschauwer, and W Van Leekwijck, "On the use of reservoir computing in popularity prediction," in 2010 2nd International

Conference on Evolving Internet, 2010: IEEE, pp 19-24

[74] G Gürsun, M Crovella, and I Matta, "Describing and forecasting video access patterns," in 2011 Proceedings IEEE INFOCOM, 2011: IEEE, pp 16- 20

[75] M Meoni, "Mining Predictive Models for Big Data Placement," U Pisa

[76] X Cheng, C Dale, and J Liu, "Statistics and social network of youtube videos," in 2008 16th Interntional Workshop on Quality of Service, 2008:

[77] Z Pabarskaite, "Implementing advanced cleaning and end-user interpretability technologies in web log mining," in ITI 2002 Proceedings of the 24th

International Conference on Information Technology Interfaces (IEEE Cat No

[78] M.-T Nguyen, T.-D Diep, T H Vinh, T Nakajima, and N Thoai,

"Analyzing and Visualizing Web Server Access Log File," in International

Conference on Future Data and Security Engineering, 2018: Springer, pp 349-

[79] X Cheng, C Dale, and J Liu, "Understanding the characteristics of internet short video sharing: YouTube as a case study," arXiv preprint arXiv:0707.3670, 2007

[80] E W Stacy, "A generalization of the gamma distribution," The Annals of mathematical statistics, vol 33, no 3, pp 1187-1192, 1962

Statistic: https://www.alexa.com/siteinfo/youtube.com." (accessed March 18, 2019)

[82] T Nakajima, M Yoshimi, C Wu, and T Yoshinaga, "Color-based cooperative cache and its routing scheme for Telco-CDNs," IEICE TRANSACTIONS on

Information and Systems, vol 100, no 12, pp 2847-2856, 2017

[83] P Dienes, The Taylor series: an introduction to the theory of functions of a complex variable Dover New York, 1957

85 [84] J Lei Ba, J R Kiros, and G E Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016

[85] K He, X Zhang, S Ren, and J Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp 770-778

[86] J Benesty, J Chen, Y Huang, and I Cohen, "Pearson correlation coefficient," in Noise reduction in speech processing: Springer, 2009, pp 1-4

[87] L Myers and M J Sirois, "Spearman correlation coefficients, differences between," Encyclopedia of statistical sciences, vol 12, 2004

[88] H Abdi, "The Kendall rank correlation coefficient," Encyclopedia of

Measurement and Statistics Sage, Thousand Oaks, CA, pp 508-510, 2007

[89] O Vinyals, Ł Kaiser, T Koo, S Petrov, I Sutskever, and G Hinton,

"Grammar as a foreign language," in Advances in neural information processing systems, 2015, pp 2773-2781

This Chapter provides a list of related published papers submitted to several international conferences in recent 2 years

[6] Minh-Tri Nguyen, Duong H.Le, Nakajima Takuma, Masato Yoshimi and Nam

Thoai, “Attention-based Neural Network: A Novel Approach for Predicting the

Popularity of Online Content” in The IEEE 21th International Conferences on High

Performance Computing and Communications (HPCC), Zhangjiajie, China, 2019

[5] Anh-Tu Ngoc Tran, Minh-Tri Nguyen, Thanh-Dang Diep, Takuma Nakajima, and Nam Thoai, “Optimizing Color-Based Cooperative Caching in Telco-CDNs by

Using Real Datasets” in The 13th International Conference on Ubiquitous

Information Management and Communication (IMCOM), Phuket, Thailand, 2019

[4] Anh-Tu Ngoc Tran, Minh-Tri Nguyen, Thanh-Dang Diep, Takuma Nakajima, and Nam Thoai, “A Performance Study of Color-Based Caching in Telco-CDNs by Using Real Datasets” in The 9th International Symposium on Information and

Communication Technology (SoICT), Da Nang, Vietnam, 2018

[3] Minh-Tri Nguyen, Thanh-Dang Diep, Tran Hoang Vinh, Takuma Nakajima, and Nam Thoai, “Analyzing and Visualizing Web Server Access Log File” in The 5th International Conference on Future Data and Security Engineering (FDSE), Ho Chi Minh City, Vietnam, 2018

[2] Anh-Tu Ngoc Tran, Huu-Phu Nguyen, Minh-Tri Nguyen, Thanh-Dang Diep,

Nguyen Quang-Hung and Nam Thoai, “pyMIC-DL: A Library for Deep Learning

Frameworks Run on the Intel Xeon Phi Coprocessor” in The IEEE 20th

International Conferences on High Performance Computing and Communications (HPCC), Exeter, United Kingdom, 2018

[1] Thanh-Dang Diep, Minh-Tri Nguyen, Nhu-Y Nguyen-Huynh, Minh Thanh

Chung, Manh-Thin Nguyen, Nguyen Quang-Hung, and Nam Thoai, “Chainer-XP: A

Flexible Framework for Artificial Neural Networks Run on the Intel Xeon Phi Coprocessor” in The 7th International Conference on High Performance Scientific

Computing Simulation, Modeling and Optimization of Complex Processes (HPSC), Hanoi, Vietnam, 2018 (Currently under review)

Tiêu đề	Applying Machine Learning Techniques in Extracting Information From the Log File
Tác giả	Nguyen Minh Tri
Người hướng dẫn	Assoc. Prof. Dr. Nam Thoai
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	Master Thesis
Năm xuất bản	2019
Thành phố	Ho Chi Minh City

Định dạng
Số trang	87
Dung lượng	2,69 MB