LIFELONG MACHINE LEARNING METHODS AND ITS APPLICATION IN MULTI-LABEL CLASSIFICATION

Công Nghệ Thông Tin, it, phầm mềm, website, web, mobile app, trí tuệ nhân tạo, blockchain, AI, machine learning - Báo cáo khoa học, luận văn tiến sĩ, luận văn thạc sĩ, nghiên cứu - Tài Chính - Financial VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY ---------- Nguyen Minh Chau LIFELONG MACHINE LEARNING METHODS AND ITS APPLICATION IN MULTI-LABEL CLASSIFICATION Major: Computer Science HANOI – 2019 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY ---------- Nguyen Minh Chau LIFELONG MACHINE LEARNING METHODS AND ITS APPLICATION IN MULTI-LABEL CLASSIFICATION Major: Computer Science Supervisor: Assoc. Prof. Ha Quang Thuy HANOI – 2019 i AUTHORSHIP “I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due reference or acknowledgement is made.” Signature:……………………………………………… ii SUPERVISOR’S APPROVAL “I hereby approve that the thesis in its current form is r eady for committee examination as a requirement for the Bachelor of Computer Science degree at the University of Engineering and Technology.” Signature: ……………………………………………… iii ACKNOWLEDGEMENT First of all, I would like to express my sincere and deepest gratitude to the teacher, Assoc. Prof. Ha Quang Thuy, who dedicated and instructed, encouraged and guided me during the research process. Secondly, I would like to thank the teachers and students in Knowledge Technology Laboratory, especially Dr. Pham Thi Ngan and Mr. Nguyen Van Quang, for their enthusiasm to work with, comment and guide me while doing research as members of the research team. Thirdly, I sincerely thank the teachers and staff of the University of Engineering and Technology, Vietnam National University, Hanoi for creating favorable conditions for me to do research. Finally, I want to thank my family and friends, especially my parents, those who always give me love, faith and encouragement. iv ABSTRACT Multi-label classification is a classification problem that classifies data which can have more than one label. Multi-label learning is very useful in text classification applications, however, there are many challenges in building training examples. One challenge is that we may not have a large amount of data for training when we face a new task. In addition, even if we spend time to collect a large amount of data, labelling multi- label data is a very time-consuming task. Hence, we should have a model that could work well when we only have a small amount of training data. Lifelong Machine Learning (LML) is one possible approach to solve this problem. Lifelong Machine Learning (or Lifelong Learning) is an advanced machine learning paradigm that learns continuously, gathers the information learned in past tasks, and utilizations it to support future learning. All the while, the learner turns out to be increasingly educated and compelling at learning. This learning capacity is one of the signs of human intelligence. Notwithstanding, the current dominant machine learning paradigm learns in isolation: given a training dataset, it runs machine learning algorithm on the dataset to create a model. It makes no endeavor to retain the information and use it in future learning. In spite of the fact that this isolated learning paradigm has been exceptionally effective, it requires an extensive number of training examples, and is appropriate for well-characterized and restricted tasks. In correlation, we people can adapt adequately with a few examples since we have aggregated a great amount of information in the past which empowers us to learn with little information or exertion. Lifelong learning means to accomplish this ability. As statistical machine learning develops, the time has come to endeavor to break the isolated learning tradition and to examine lifelong learning figuring out how to bring machine learning higher than ever. Applications, for example, intelligent assistants, chatbots, and physical robots that cooperate with humans and systems in real-life environments are also calling for such lifelong learning capabilities. Without the capacity to amass the information and use it to adapt more learning gradually, a system will presumably never be truly intelligent. v TABLE OF CONTENTS ABSTRACT.......................................................................................................................... iv TABLE OF CONTENTS....................................................................................................... v List of Figures ...................................................................................................................... vii List of tables........................................................................................................................viii TÓM TẮT ............................................................................................................................. ix Chapter 1. INTRODUCTION................................................................................................ 1 1.1 Motivation ............................................................................................................... 1 1.2. Contributions and thesis format .................................................................................. 4 1.2.1. Contributions ........................................................................................................ 4 1.2.2. Thesis formats ...................................................................................................... 4 Chapter 2. RELATED WORK .............................................................................................. 5 2.1. Lifelong machine learning .......................................................................................... 5 2.1.1. Definition of lifelong learning .............................................................................. 5 2.1.2. System architecture of lifelong learning............................................................... 8 2.1.3. Lifelong topic modeling ..................................................................................... 10 2.1.4. LTM: a lifelong topic model .............................................................................. 11 2.1.5. AMC: a lifelong topic model for small data ....................................................... 14 2.2. Classifiers .................................................................................................................. 20 2.2.1. K-nearest neighbors ............................................................................................ 20 2.2.2. Naïve Bayes ........................................................................................................ 22 2.2.3. Decision trees ..................................................................................................... 24 2.2.4. Gaussian Processes ............................................................................................. 27 2.2.5. Random forest .................................................................................................... 27 2.2.6. Multilayer Perceptrons (MLP) ........................................................................... 29 2.2.7. AdaBoost ............................................................................................................ 30 Chapter 3. THE METHOD .................................................................................................. 31 3.1. Problem formulation ................................................................................................. 31 3.2. The closeness of previous datasets to the current dataset ......................................... 32 3.3. The closeness of two datasets ................................................................................... 33 3.4. Proposed model of lifelong topic modeling using close domain knowledge for multi- label classification ............................................................................................................ 33 vi Chapter 4. RESULTS AND DISCUSSIONS ...................................................................... 35 4.1. The datasets ............................................................................................................... 35 4.2. Experimental scenarios ............................................................................................. 37 4.3. Experimental results and discussions ........................................................................ 38 REFERENCES .................................................................................................................... 40 vii List of Figures Figure 1. The system architecture of lifelong machine learning ................................8 Figure 2: The Lifelong Topic Model (LTM) system architecture. ...........................13 Figure 3: The AMC model system architecture. .......................................................15 Figure 4: Entropy function with n = 2.......................................................................25 Figure 5. The lifelong topic model using close domain knowledge for multi-label classification ..............................................................................................34 viii List of tables Table 1. Data division details ....................................................................................36 Table 2. The experimental results with 50 reviews in D4 and using kNN, Decision Tree as classifying methods .......................................................................38 Table 3. The experimental results with 50 reviews in D4 and using Random Forest, MLP, AdaBoost, Gaussian Naïve Bayes as classifying methods ..............38 Table 4. The experimental results with 100 reviews in D4 and using kNN, Decision Tree as classifying methods .......................................................................39 Table 5. The experimental results with 100 reviews in D4 and using Random Forest, MLP, AdaBoost, Gaussian Naïve Bayes as classifying methods ..............39 ix TÓM TẮT Phân loại đa nhãn là lớp bài toán phân lớp mà đối tượng dữ liệu có thể có nhiều hơn một nhãn. Bộ học phân lớp đa nhãn rất hữu ích trong các ứng dụng phân loại văn bản, tuy nhiên, có nhiều thách thức trong việc xây dựng bộ ví dụ đào tạo. Một thách thức là chúng ta có thể không có một lượng lớn dữ liệu để đào tạo khi chúng ta đối mặt với một tác vụ mới. Ngoài ra, ngay cả khi chúng ta dành thời gian để thu thập một lượng lớn dữ liệu, việc gán nhãn dữ liệu đa nhãn là một công việc rất tốn thời gian. Do đó, chúng ta nên có một mô hình có thể hoạt động tốt khi chúng ta chỉ có một lượng nhỏ dữ liệu đào tạo. Học máy suốt đời (Lifelong Machine Learning - LML) là một cách tiếp cận khả thi để giải quyết vấn đề này. Học máy suốt đời (hay học suốt đời) là một mô hình học máy tiên tiến, học liên tục, thu thập thông tin học được trong các tác vụ trước đây và sử dụng nó để hỗ trợ cho việc học trong tương lai. Trong khi đó, bộ học ngày càng được có nhiều kiến thức và hiệu quả hơn trong việc học. Năng lực học tập này là một trong những dấu hiệu của trí tuệ con người. Mặc dù vậy, mô hình học máy phổ biến hiện tại lại học một cách cô lập: nó được cung cấp một tập dữ liệu đào tạo và chạy thuật toán học máy trên tập dữ liệu để tạo ra một mô hình. Nó không có nỗ lực để giữ lại thông tin và sử dụng nó trong việc học tập trong tương lai. Mặc dù thực tế là mô hình học tập cô lập này có hiệu quả, nhưng nó đòi hỏi một số lượng lớn các ví dụ đào tạo, và chỉ phù hợp cho các bài toán được định nghĩa rõ ràng. Trong khi đó, con người chúng ta có thể học chỉ với một vài ví dụ vì chúng ta đã tổng hợp một lượng lớn thông tin trong quá khứ cho phép chúng ta học hỏi với ít thông tin hoặc nỗ lực. Học máy suốt đời có thể thực hiện khả năng này. Đã đến lúc nỗ lực phá vỡ truyền thống học máy cô lập và tiến tới học máy suốt đời để tìm ra cách đưa học máy lên một nấc cao hơn. Các ứng dụng như trợ lý thông minh, chatbot và robot vật lý tương tác vớicon người và các hệ thống trong môi trường thực tế cũng cần đến khả năng học tập suốt đời. Nếu không có khả năng tích lũy thông tin và sử dụng nó để phục vụ việc học hỏi dần dần, một hệ thống có lẽ sẽ không bao giờ được coi là thực sự thông minh. x ABBREVATIONS LML Lifelong Machine Learning ML Machine Learning AI Artificial Intelligence kNN k-nearest neighbors NBC Naive Bayes Classifier CART Classification and Regression Trees MLP Multilayer Perceptrons 1 Chapter 1 INTRODUCTION 1.1 Motivation Multi-label text classification is a problem with many practical applications. For example, you own a hotel and you care about what your customers refer to about your hotel service on the hotel website or on a social network (e.g. Facebook). Those aspects could be the attitude of the staff, the view from the hotel, the price of the room, the quality of the hotel food, ... There are thousands or even tens of thousands of reviews on your website or on social network. However, it is very time consuming to read and classify each review. A good approach is that you have a classification of these reviews, however, assigning labels to thousands of reviews is also a waste of time and effort. If there is a model that could classify well with just a small amount of labeled data, you will save your time and effort. The commonly used approach is machine learning. Chen and Liu x state that “Machine learning (ML) has been instrumental for the advances of both data analysis and artificial intelligence (AI)”. The ongoing accomplishment of profound learning conveys it 2 to another tallness. ML algorithms have been utilized in practically all zones of software engineering and numerous zones of regular science, building, and sociologies. Commonsense applications are much progressively far reaching. Without effective ML algorithms, numerous ventures would not have prospered, e.g., Internet commerce and Web search. The current dominant paradigm for machine learning is to run a ML algorithm on a dataset to create a model. The model is then applied in tasks in real-life. For both supervised learning and unsupervised learning, this is true. It is called isolated learning because it does not think about some other related data or the learned information. The crucial issue with this isolated learning paradigm is that it does not retain and accumulate knowledge learned before and use it in future learning. This is not the way human learns. We people never learn isolation. We generally retain the information learned before and use it to support future learning and problem solving. That is the reason at whatever point we experience another circumstance or issue, we may see that numerous parts of it are not really new because of the fact that we have seen them in the past in some different contexts. Without the capacity to collect information, a ML algorithm ordinarily needs a big number of training examples to learn effectively. For supervised learning, marking of data labels is regularly done physically, which is exceptionally time-consuming and tedious. Since the world is too complex with too many possible tasks, it is almost impossible to label a large number of examples for every possible task or application for an ML algorithm to learn. To make matters worse, everything around us additionally changes always, and the labeling in this way should be done continuously, which is an overwhelming task for people. Even for unsupervised learning, gathering a substantial volume of data may not be possible in many cases. In contrast, we human beings seem to learn quite differently. We accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems. Over time we learn more and more and become more and more knowledgeable, and more and more effective at learning. Lifelong Machine Learning (LML) (or simply lifelong learning) aims to mimic this human learning process and capability. This type of learning is quite natural because things around us are closely related and interconnected. Knowledge learned about some subjects can help us understand and learn some other subjects. For example, we humans do not need 1,000 positive online reviews and 1,000 negative online reviews of movies as an ML algorithm would need in order to build an accurate classifier to classify positive and negative reviews about a movie. In fact, for this task, without a single training 3 example, we can already perform the classification task. How can that be? The reason is simple. It is because we have accumulated so much knowledge in the past about the language expressions that people use to praise and criticize things, although none of those praises or criticisms may be in the form of online reviews. Interestingly, if we do not have such past knowledge, we humans are probably unable to manually build a good classifier even with 1,000 training positive reviews and 1,000 training negative reviews without spending an enormous amount of time. For example, if you have no knowledge of Arabic or Arabic sentiment expressions and someone gives you 2,000 labeled training reviews in Arabic and asks you to build a classifier manually, most probably you will not be able to do it without using a translator. To make the case more general, we use natural language processing (NLP) as an example. It is easy to see the importance of LML to NLP for several reasons. First, words and phrases have almost the same meaning in all domains and all tasks. Second, sentences in every domain follow the same syntax or grammar. Third, almost all natural language processing problems are closely related to each other, which means that they are inter-connected and affect each other in some ways. The first two reasons ensure that the knowledge learned can be used across domains and tasks due to the sharing of the same expressions and meanings and the same syntax. That is why we humans do not need to re-learn the language (or learn a new language) whenever we encounter a new application domain. For example, assume we have never studied psychology, and we want to study it now. We do not need to learn the language used in the psychology text except some new concepts in the psychology domain because everything about the language itself is the same as in any other domain or area. The third reason ensures that LML can be used across different types of tasks. Traditionally, these problems are solved separately in isolation, but they are all related and can help each other because the results from one problem can be useful to others. This situation is common for all NLP tasks. Note that we regard anything from unknown to known as a piece of knowledge. Thus, a learned model is a piece of knowledge and the results gained from applying the model are also knowledge, although they are different kinds of knowledge. A large quantity of knowledge is often needed in order to effectively help the new task learning because the knowledge gained from one previous task may contain only a tiny bit or even no knowledge that is applicable to the new task (unless the two tasks are extremely similar). Thus, it is important to learn from a large number of diverse domains to accumulate a large amount of diverse knowledge over time. A future task can pick and 4 choose the appropriate knowledge to use to help its learning. As the world also changes constantly, the learning should thus be continuous and lifelong, which is what we humans do. The classic isolated learning paradigm is unable to perform such lifelong learning. Isolated learning is only suitable for narrow and restricted tasks. It is probably not sufficient for building an intelligent system that can learn continuously to achieve close to the human level of intelligence. LML aims to make progress in this direction. With the popularity of interactive robots, intelligent personal assistants, and chatbots, LML is becoming increasingly important because these systems have to interact with humans andor other systems, learn constantly in the process, and retain and accumulate the knowledge learned in the interactions in the ever changing environments to enable them to learn more and learn better over time and to function seamlessly. 1.2. Contributions and thesis format 1.2.1. Contributions The thesis has three main contributions including (i) proposing a lifelong topic modeling method in which using prior domain knowledge from close domains, (ii) proposing three close domain measure based on the similar measure, features of probability and features of classifiers performing on them, and (iii) performing some applications of multi-label classification using proposed approaches. 1.2.2. Thesis formats The rest of this thesis is organized as follows:  Chapter 2 presents related work that will be used by the proposed method.  Chapter 3 presents the proposed method.  Chapter 4 provides the details of the experiments and results as solving text multi- label classification problem. 5 Chapter 2 RELATED WORK This chapter presents lifelong machine learning – the learning paradigm that will be used in the proposing method. 2.1. Lifelong machine learning 2.1.1. Definition of lifelong learning The earlier definition of LML is as follows: The system has performed N tasks. When faced with the (N + 1)th task, it uses the knowledge gained from the N tasks to help the (N + 1)th task. Here we extend this definition by giving it more details, mainly by adding an explicit knowledgebase (KB) to stress the importance of knowledge accumulation and meta-mining of additional higher-level knowledge from the knowledge retained from previous learning. 6 Lifelong machine learning is a continuous learning process. At any point in time, the learner has performed a sequence of N learning tasks T1, T2,…, TN. We call these tasks previous tasks. They has their corresponding datasets D1, D2,…, DN. When facing with the (N + 1)th task TN + 1 (which is called the new task or the current task) with its dataset D N + 1 , the learner can leverage the past knowledge in the knowledge base to help learn the current task TN + 1. After the completion of learning TN + 1 , the knowledge gained from learning TN + 1 will be updated to the knowledge base . The learner will continue to learn whenever it faces a new task. The objective of LML is usually to optimize the performance on the new task TNC1 , but it can optimize on any task by treating the rest of the tasks as the previous tasks. KB maintains the knowledge learned and accumulated from learning the previous tasks. After the completion of learning TNC1 , KB is updated with the knowledge (e.g., intermediate as well as the final results) gained from learning TNC1 . The updating can involve consistency checking, reasoning, and meta-mining of additional higher-level knowledge. Since this definition is quite general, some remarks are in order: 1. The definition shows three key characteristics of LML: (1) continuous learning, (2) knowledge accumulation and maintenance in the knowledge base (KB), and (3) the ability to use the past knowledge to help future learning. That is, the lifelong learner learns a series of tasks, possibly never ending, and in the process, it becomes more and more knowledgeable, and better at learning. These characteristics make LML different from related learning paradigms such as transfer learning and multi-task learning, which do not have one or more of these characteristics. 2. The tasks do not have to be from the same domain. There is no unified definition of a domain in the literature that is applicable to all areas. In most cases, the term is used in formally to mean a setting with a fi xed feature space where there can be multiple different tasks of the same type or of different ty pes (e.g., information extraction, conference resolution, and entity linking). Some researchers even use domain and task interchangeably because there is only one task from each domain in their study. We also use them interchangeably in many cases due to the same 7 reason but will distinguish them when needed. 3. The shift to the new task can happen abruptly or gradually, and the tasks and their data do not have to be provided by some external systems or human users. Ideally, a lifelong learner should find its own learning tasks and training data in its interaction with the environment by performing self-motivated learning. For example, a service robot in a hotel may be trained to recognize the faces of a group of guests initially in order to greet them, but in its interaction with the guests it may find a new guest whom it does not recognize. It can then take some pictures and learn to recognize himher and associate himher with a name obtained by asking the guest. In this way, the robot can greet the new guest next time in a personalized manner. 4. The definition does not give details about knowledge or its representation in the knowledge base (KB) because of our limited understanding. Current papers use only one or two specific types of knowledge suitable for their proposed techniques. The problem of knowledge representation is still an active research topic. The definition also does not specify how to maintain and update the knowledge base. For a particular application, one can design a KB based on the application need. We will discuss some possible components of the KB below. 5. The definition indicates that LML may require a systems approach that combines multiple learning algorithms and different knowledge representation schemes. It is not likely that a single learning algorithm is able to achieve the objective of LML. 6. There is still no generic LML system that is able to perform LML in all possible domains for all possible types of tasks. In fact, we are far from that. That is, unlike many machine learning algorithms such as SVM and deep learning, which can be applied to any learning task as long as the data is represented in a specific format. Current LML algorithms are still quite specific to some types of tasks and data. 8 2.1.2. System architecture of lifelong learning Figure 1. The system architecture of lifelong machine learning From the definition and the remarks, we can outline a general process of LML and an LML system architecture. Figure 1 illustrates the process and the architecture. Below, we first describe the key components of the system and then discuss the LML process. We note that this general architecture is for illustration purposes. Not all existing systems use all the components or subcomponents. In fact, most current systems are much simpler. 1. Knowledge Base (KB): It mainly stores the previously learned knowledge. It has a few subcomponents: (a) Past Information Store (PIS): It stores the information resulted from the past learning, including the resulting models, patterns, or other forms of outcome. PIS may also involve sub-stores for information such as (1) the original data used in each previous task, (2) intermediate results from each previous task, and (3) the final model or patterns learned from each previous task. As for what information or knowledge should 9 be retained, it depends on the learning task and the learning algorithm. For a particular system, the user needs to decide what to retain in order to help future learning. (b) Meta-Knowledge Miner (MKM). It performs meta-mining of the knowledge in the PIS and in the meta-knowledge store (see below). We call this meta-mining because it mines higher-level knowledge from the saved knowledge. The resulting knowledge is stored in the Meta-Knowledge Store. Here multiple mining algorithms may be used to produce different types of results. (c) Meta-Knowledge Store (MKS): It stores the knowledge mined or consolidated from PIS (Past Information Store) and also from MKS itself. Some suitable knowledge representation schemes are needed for each application. (d) Knowledge Reasoner (KR): It makes inference based on the knowledge in MKB and PIS to generate more knowledge. Most current systems do not have this subcomponent. However, with the advance of LML, this component will become increasingly important. Since the current LML research is still in its infancy, as indicated above, none of the existing systems has all these sub-components. 2. Knowledge-Based Learner (KBL): For LML, it is necessary for the learner to be able to use prior knowledge in learning. We call such a learner a knowledge-based learner, which can leverage the knowledge in the KB to learn the new task. This component may have two subcomponents: (1) Task knowledge miner (TKM), which makes use of the raw knowledge or information in the KB to mine or identify knowledge that is appropriate for the current task. This is needed because in some cases, KBL cannot use the raw knowledge in the KB directly but needs some task-specific and more general knowledge mined from the KB. (2) The learner that can make use of the mined knowledge in learning. 3. Output: This is the learning result for the user, which can be a prediction model or classifier in supervised learning, clusters or topics in unsupervised learning, a policy in reinforcement learning, etc. 4. Task Manager (TM): It receives and manages the tasks that arrive in the system, and handles the task shift and presents the new learning task to the KBL in a lifelong manner. 10 2.1.3. Lifelong topic modeling Topic models, such as LDA and pLSA, are unsupervised learning methods for discovering topics from a set of text documents. They have been applied to numerous applications, e.g., opinion mining, machine translation, word sense disambiguation, phrase extraction, and information retrieval. In general, topic models assume that each document discusses a set of topics, probabilistically, a multinomial distribution over the set of topics, and each topic is indicated by a set of topical words, probabilistically, a multinomial distribution over the set of words. The two kinds of distributions are called document-topic distribution and topic-word distribution respectively. The intuition is that some words are more or less likely to be present given the topics of a document. For example, “sport” and “player” will appear more often in documents about sports; “rain” and “cloud” will appear more frequently in documents about weather. However, fully unsupervised topic models tend to generate many inscrutable topics. The main reason is that the objective functions of topic models are not always consistent with human judgment. To deal with this problem, we can use any of the following three approaches:  Inventing better topic models: This approach may work if a large number of documents is available. If the number of documents is small, regardless of how good the model is, it will not generate good topics simply because topic models are unsupervised learning methods and insufficient data cannot provide reliable statistics for modeling. Some form of supervision or external information beyond the given documents is necessary.  Asking users to provide prior domain knowledge: This approach asks the user or a domain expert to provide some prior domain knowledge. One form of knowledge can be in the form of must-links and cannot-links. A must-link states that two terms (or words) should belong to the same topic, e.g., price and cost. A cannot-link indicates that two terms should not be in the same topic, e.g., price and picture. Some existing knowledge-based topic models have used such prior domain knowledge to produce better topics. However, asking the user to provide prior knowledge is problematic in practice because the user may not know what 11 knowledge to provide and wants the system to discover useful knowledge for himher. It also makes the approach non-automatic.  Using lifelong topic modeling: This approach incorporates LML in topic modeling. Instead of asking the user to provide prior knowledge, prior knowledge is learned and accumulated automatically in the modeling of previous tasks. For example, we can use the topics resulted from modeling of previous tasks as the prior knowledge to help the new task modeling. The approach works because of the observation that there are usually a great deal of sharing of concepts or topics across domains and tasks in natural language processing, e.g., in sentiment analysis. At the beginning, the KB is either empty or filled with knowledge from an external source such as WordNet. It grows with the results of incoming topic modeling tasks. Since all the tasks are about topic modeling, we use domains to distinguish the tasks. Two topic modeling tasks are different if their corpus domains are different. The scope of a domain is quite general. A domain can be a category (e.g., sports) or a product (e.g., camera) or an event (e.g., presidential election). We use to denote the sequence of previous tasks, to denote their corresponding data or corpora, and use to denote the new or current task with its data . 2.1.4. LTM: a lifelong topic model LTM (Lifelong Topic Model) was proposed in Chen and Liu. It works in the following lifelong setting: At a particular point in time, a set of N previous modeling tasks have been performed. From each past taskdomain data (or document set) , a set of topics has been generated. Such topics are called prior topics (or p-topics for short). Topics from all past tasks are stored in the Knowledge Base (KB) (known as the topic base). At a new time point, a new task represented by a new domain document set arrives for topic modeling. This is also called the current domain. LTM does not directly use the p-topics in as knowledge to help its modeling. Instead, it mines must-links from and uses the must- links as prior knowledge to help model inferencing for the (N + 1)th task. The process is 12 dynamic and iterative. Once modeling on is done, its resulting topics are added to for future use. LTM has two key characteristics:  LTM’s knowledge mining is targeted, meaning that it only mines useful knowledge from those relevant p-topics in . To do this, LTM performs a topic modeling on first to find some initial topics and then uses these topics to find similar p- topics in . Those similar p-topics are used to mine must-links (knowledge) which are more likely to be applicable and correct. These must-links are then used in the next iteration of modeling to guide the inference to generate more accurate topics.  LTM is a fault-tolerant model as it is able to deal with errors in automatically mined must-links. First, due to wrong topics (topics with many incoherentwrong words or topics without a dominant semantic theme) in or mining errors, the words in a must-link may not belong to the same topic in general. Second, the words in a must- link may belong to the same topic in some domains, but not in others due to the domain diversity. Thus, to apply such knowledge in modeling, the model must deal with possible errors in must-links. We will discuss about LTM model. Like many topic models, LTM uses Gibbs sampling for inference. Its graphical model is the same as LDA, but it has a very different sampler which can incorporate prior knowledge and also handle errors in the knowledge as indicated above. The LTM system is illustrated in Figure 2. 13 Figure 2: The Lifelong Topic Model (LTM) system architecture. LTM works as follows: It first runs the Gibbs sampler of LTM for M iterations (or sweeps) to find a set of initial topics from with no knowledge. It then makes another M Gibbs sampling sweep. But before each of these new sweeps, it first mines a set of targeted must-links (knowledge) for every topic in using the function TopicKnowledgeMiner and then uses to generate a new set of topics from . To distinguish topics in from p-topics, these new topics are called the current topics (or c-topics for short). We say that the mined must-links are targeted because they are mined based on the c-topics in and are targeted at improving the topics in . Note that to make the algorithm more efficient, it is not necessary to mine knowledge for every sweep. Then it simply updates the knowledge base, which is simple, as each task is from a distinct domain. The set of topics is simply added to the knowledge base S for future use. 14 The algorithm is as follows: The function TopicKnowledgeMiner is as follows: 2.1.5. AMC: a lifelong topic model for small data The LTM model needs a fairly large set of documents in order to generate reasonable initial topics to be used in finding similar past topics in the knowledge base to mine appropriate must-link knowledge. However, when the document set (or data) is very small, this approach does not work because the initial modeling produces very poor topics, which cannot be used to find matching or similar past topics in the knowledge base to serve as prior knowledge. A new approach is th...

INTRODUCTION

Motivation

Multi-label text classification is a problem with many practical applications For example, you own a hotel and you care about what your customers refer to about your hotel service on the hotel website or on a social network (e.g Facebook) Those aspects could be the attitude of the staff, the view from the hotel, the price of the room, the quality of the hotel food, There are thousands or even tens of thousands of reviews on your website or on social network However, it is very time consuming to read and classify each review A good approach is that you have a classification of these reviews, however, assigning labels to thousands of reviews is also a waste of time and effort If there is a model that could classify well with just a small amount of labeled data, you will save your time and effort

The commonly used approach is machine learning Chen and Liu [x] state that

“Machine learning (ML) has been instrumental for the advances of both data analysis and artiﬁcial intelligence (AI)” The ongoing accomplishment of profound learning conveys it

2 to another tallness ML algorithms have been utilized in practically all zones of software engineering and numerous zones of regular science, building, and sociologies Commonsense applications are much progressively far reaching Without effective ML algorithms, numerous ventures would not have prospered, e.g., Internet commerce and Web search The current dominant paradigm for machine learning is to run a ML algorithm on a dataset to create a model The model is then applied in tasks in real-life For both supervised learning and unsupervised learning, this is true It is called isolated learning because it does not think about some other related data or the learned information The crucial issue with this isolated learning paradigm is that it does not retain and accumulate knowledge learned before and use it in future learning This is not the way human learns

We people never learn isolation We generally retain the information learned before and use it to support future learning and problem solving That is the reason at whatever point we experience another circumstance or issue, we may see that numerous parts of it are not really new because of the fact that we have seen them in the past in some different contexts Without the capacity to collect information, a ML algorithm ordinarily needs a big number of training examples to learn effectively For supervised learning, marking of data labels is regularly done physically, which is exceptionally time-consuming and tedious Since the world is too complex with too many possible tasks, it is almost impossible to label a large number of examples for every possible task or application for an

ML algorithm to learn To make matters worse, everything around us additionally changes always, and the labeling in this way should be done continuously, which is an overwhelming task for people Even for unsupervised learning, gathering a substantial volume of data may not be possible in many cases In contrast, we human beings seem to learn quite differently We accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems Over time we learn more and more and become more and more knowledgeable, and more and more effective at learning Lifelong Machine Learning (LML) (or simply lifelong learning) aims to mimic this human learning process and capability This type of learning is quite natural because things around us are closely related and interconnected Knowledge learned about some subjects can help us understand and learn some other subjects For example, we humans do not need 1,000 positive online reviews and 1,000 negative online reviews of movies as an ML algorithm would need in order to build an accurate classifier to classify positive and negative reviews about a movie In fact, for this task, without a single training

3 example, we can already perform the classification task How can that be? The reason is simple It is because we have accumulated so much knowledge in the past about the language expressions that people use to praise and criticize things, although none of those praises or criticisms may be in the form of online reviews Interestingly, if we do not have such past knowledge, we humans are probably unable to manually build a good classifier even with 1,000 training positive reviews and 1,000 training negative reviews without spending an enormous amount of time For example, if you have no knowledge of Arabic or Arabic sentiment expressions and someone gives you 2,000 labeled training reviews in Arabic and asks you to build a classifier manually, most probably you will not be able to do it without using a translator To make the case more general, we use natural language processing (NLP) as an example It is easy to see the importance of LML to NLP for several reasons First, words and phrases have almost the same meaning in all domains and all tasks Second, sentences in every domain follow the same syntax or grammar Third, almost all natural language processing problems are closely related to each other, which means that they are inter-connected and affect each other in some ways The first two reasons ensure that the knowledge learned can be used across domains and tasks due to the sharing of the same expressions and meanings and the same syntax That is why we humans do not need to re-learn the language (or learn a new language) whenever we encounter a new application domain For example, assume we have never studied psychology, and we want to study it now We do not need to learn the language used in the psychology text except some new concepts in the psychology domain because everything about the language itself is the same as in any other domain or area The third reason ensures that LML can be used across different types of tasks Traditionally, these problems are solved separately in isolation, but they are all related and can help each other because the results from one problem can be useful to others This situation is common for all NLP tasks Note that we regard anything from unknown to known as a piece of knowledge Thus, a learned model is a piece of knowledge and the results gained from applying the model are also knowledge, although they are different kinds of knowledge A large quantity of knowledge is often needed in order to effectively help the new task learning because the knowledge gained from one previous task may contain only a tiny bit or even no knowledge that is applicable to the new task (unless the two tasks are extremely similar) Thus, it is important to learn from a large number of diverse domains to accumulate a large amount of diverse knowledge over time A future task can pick and

4 choose the appropriate knowledge to use to help its learning As the world also changes constantly, the learning should thus be continuous and lifelong, which is what we humans do The classic isolated learning paradigm is unable to perform such lifelong learning Isolated learning is only suitable for narrow and restricted tasks It is probably not sufficient for building an intelligent system that can learn continuously to achieve close to the human level of intelligence LML aims to make progress in this direction With the popularity of interactive robots, intelligent personal assistants, and chatbots, LML is becoming increasingly important because these systems have to interact with humans and/or other systems, learn constantly in the process, and retain and accumulate the knowledge learned in the interactions in the ever changing environments to enable them to learn more and learn better over time and to function seamlessly.

Contributions and thesis format

The thesis has three main contributions including (i) proposing a lifelong topic modeling method in which using prior domain knowledge from close domains, (ii) proposing three close domain measure based on the similar measure, features of probability and features of classifiers performing on them, and (iii) performing some applications of multi-label classification using proposed approaches

The rest of this thesis is organized as follows:

 Chapter 2 presents related work that will be used by the proposed method

 Chapter 3 presents the proposed method

 Chapter 4 provides the details of the experiments and results as solving text multi- label classification problem

RELATED WORK

Lifelong machine learning

The earlier deﬁnition of LML is as follows: The system has performed N tasks

When faced with the (N + 1)th task, it uses the knowledge gained from the N tasks to help the (N + 1)th task Here we extend this deﬁnition by giving it more details, mainly by adding an explicit knowledgebase (KB) to stress the importance of knowledge accumulation and meta-mining of additional higher-level knowledge from the knowledge retained from previous learning

Lifelong machine learning is a continuous learning process At any point in time, the learner has performed a sequence of N learning tasks T 1 , T 2 ,…, T N We call these tasks previous tasks They has their corresponding datasets D 1 , D 2 ,…, D N When facing with the

(N + 1)th task T N + 1 (which is called the new task or the current task) with its dataset D N +

1 , the learner can leverage the past knowledge in the knowledge base to help learn the current task T N + 1 After the completion of learning TN + 1, the knowledge gained from learning TN + 1 will be updated to the knowledge base The learner will continue to learn whenever it faces a new task

The objective of LML is usually to optimize the performance on the new task T NC1 , but it can optimize on any task by treating the rest of the tasks as the previous tasks

KB maintains the knowledge learned and accumulated from learning the previous tasks After the completion of learning T NC1 , KB is updated with the knowledge (e.g., intermediate as well as the ﬁnal results) gained from learning T NC1 The updating can involve consistency checking, reasoning, and meta-mining of additional higher-level knowledge

Since this deﬁnition is quite general, some remarks are in order:

1 The deﬁnition shows three key characteristics of LML: (1) continuous learning, (2) knowledge accumulation and maintenance in the knowledge base (KB), and (3) the ability to use the past knowledge to help future learning That is, the lifelong learner learns a series of tasks, possibly never ending, and in the process, it becomes more and more knowledgeable, and better at learning These characteristics make LML diﬀerent from related learning paradigms such as transfer learning and multi-task learning, which do not have one or more of these characteristics

2 The tasks do not have to be from the same domain There is no unified definition of a domain in the literature that is applicable to all areas In most cases, the term is used in formally to mean a setting with a fixed feature space where there can be multiple different tasks of the same type or of different types (e.g., information extraction, conference resolution, and entity linking) Some researchers even use domain and task interchangeably because there is only one task from each domain in their study We also use them interchangeably in many cases due to the same

7 reason but will distinguish them when needed

3 The shift to the new task can happen abruptly or gradually, and the tasks and their data do not have to be provided by some external systems or human users Ideally, a lifelong learner should ﬁnd its own learning tasks and training data in its interaction with the environment by performing self-motivated learning For example, a service robot in a hotel may be trained to recognize the faces of a group of guests initially in order to greet them, but in its interaction with the guests it may ﬁnd a new guest whom it does not recognize It can then take some pictures and learn to recognize him/her and associate him/her with a name obtained by asking the guest In this way, the robot can greet the new guest next time in a personalized manner

4 The definition does not give details about knowledge or its representation in the knowledge base (KB) because of our limited understanding Current papers use only one or two specific types of knowledge suitable for their proposed techniques The problem of knowledge representation is still an active research topic The definition also does not specify how to maintain and update the knowledge base For a particular application, one can design a KB based on the application need We will discuss some possible components of the KB below

5 The deﬁnition indicates that LML may require a systems approach that combines multiple learning algorithms and diﬀerent knowledge representation schemes It is not likely that a single learning algorithm is able to achieve the objective of LML

6 There is still no generic LML system that is able to perform LML in all possible domains for all possible types of tasks In fact, we are far from that That is, unlike many machine learning algorithms such as SVM and deep learning, which can be applied to any learning task as long as the data is represented in a specific format Current LML algorithms are still quite specific to some types of tasks and data

2.1.2 System architecture of lifelong learning

Figure 1 The system architecture of lifelong machine learning

From the definition and the remarks, we can outline a general process of LML and an LML system architecture Figure 1 illustrates the process and the architecture Below, we first describe the key components of the system and then discuss the LML process We note that this general architecture is for illustration purposes Not all existing systems use all the components or subcomponents In fact, most current systems are much simpler

1 Knowledge Base (KB): It mainly stores the previously learned knowledge It has a few subcomponents:

(a) Past Information Store (PIS): It stores the information resulted from the past learning, including the resulting models, patterns, or other forms of outcome PIS may also involve sub-stores for information such as (1) the original data used in each previous task, (2) intermediate results from each previous task, and (3) the final model or patterns learned from each previous task As for what information or knowledge should

9 be retained, it depends on the learning task and the learning algorithm For a particular system, the user needs to decide what to retain in order to help future learning

(b) Meta-Knowledge Miner (MKM) It performs meta-mining of the knowledge in the PIS and in the meta-knowledge store (see below) We call this meta-mining because it mines higher-level knowledge from the saved knowledge The resulting knowledge is stored in the Meta-Knowledge Store Here multiple mining algorithms may be used to produce different types of results

(c) Meta-Knowledge Store (MKS): It stores the knowledge mined or consolidated from PIS (Past Information Store) and also from MKS itself Some suitable knowledge representation schemes are needed for each application

Classifiers

K-nearest neighbors (kNN) is one of the simplest supervised-learning algorithms (which are effective in some cases) in machine learning When training, this algorithm does not learn anything from training data (this is the reason why this algorithm is classified as “lazy learning”), all calculations are done when it needs to predict the results of new data K-nearest neighbors can be applied to both types of supervised learning problem, classification and regression kNN is also called an instance-based or memory- based learning algorithm

With kNN, in the classification problem, the label of a new data point (or the result of the question in the test) is deduced directly from the nearest k data points in the training set The label of a test data can be determined by major voting between the nearest points, or it can be deduced by weighting different things for each of those closest points

In the regresssion problem, the output of a data point will equal the output of the nearest known data point (in the case of k = 1), or the weighted average of the output of the nearest points, or by a relationship based on the distance to those nearest points

In short, kNN is the algorithm to find the output of a new data point by only relying on the information of k data points in the closest training set (k-neighboring), not interested in having some data points in these closest points are noise The figure below is an example of kNN with k = 1

An example of applying 1-NN in a classification problem

The above example is a classification problem with 3 classes: red, blue, green Each test data point will be labeled according to the color of the point it belongs to

Consider a classification problem with C classes 1, 2, , C Suppose there is a data point

Calculate the probability that this data point falls into class c In other words, calculate: or briefly

This expression will help us determine the probability that the data point falls into each class Then it can help determine the class of that data point by selecting the class with the highest probability:

Expression (2) is often difficult to calculate directly Instead, Bayesian rules are often used: can be interpreted as the probability of a point falling into class c This value can be calculated by Maximum Likelihood Estimation (MLE), which is the ratio of the number of data points belongs to this class divided by the total amount of data in the training set; or can also be evaluated with MAP (Maximum A Posteriori) estimation

The other component , which is the distribution of data points in class c, is often difficult to calculate because is a multi-dimensional random variable, needing a lot of training data to be able to build this distribution To make the calculation simpler, it is often assumed that the components of the random variable are independent, if know c

The independent assumption of the dimensions of the data, if known c, is too tight and we seldom find data that are completely independent of each other However, NBC, thanks to simplicity, has very quick training and testing speed This makes it highly effective in large-scale problems

In training phase, the distributions and will be determined based on training data Determining these values can be based on the MLE (Maximum Likelihood Estimation) or MAP (Maximum A Posteriori) estimation

In the test phase, with a new data point , its class will be determined by:

When d is large and the probability is small, the expression on the right-hand side of (7) will be a very small number, when the calculation may encounter errors To solve this, (7) is often rewritten in the equivalent form by taking the log of the right:

This does not affect the result because log is a covariate function on a set of positive numbers

This model is mainly used in data types where components are continuous variables For each dimension i and a class c, follow a normal distribution with expected value and variance :

In which, parameter set is determined by MLE:

Given a probability distribution of a discrete variable can get n different values

Suppose that the probability to receive these values is with This distribution symbol is Entropy of this distribution is defined as: where log is natural logarit and

Consider an example with n = 2 given in the following figure In the case where p is the purest, that is, one of the two p i equals 1, the other is 0, the entropy of this distribution is

When p is the most opaque, i.e both p i = 0.5, the entropy function reaches the highest value

In ID3, the weighted sum of entropy at the leaf-nodes after building the decision tree is considered the loss function of that decision tree The weights here are proportional to the number of data points assigned to each node ID3 will find a logical division (a reasonable order of choosing attributes) so that the final loss function is as small as possible As mentioned, this is achieved by selecting attributes so that if the attribute is used to divide, the entropy at each step decreases by the largest amount The problem of building a decision tree with ID3 can be divided into small problems, in each problem, we just need to choose the attribute that makes the division the best result Each of these small problems corresponds to the data division in a non-leaf node We will build a calculation method based on each of these nodes

Consider a problem with different C classes Suppose we are working with a non-leaf node with data points forming a set S with the number of elements being |S| = N Suppose further that in N data points, N c , c = 1, 2, , C points belong to class c The probability that each data point falls into a class c is approximately equal to (Maximum Likelihood Estimation) Thus, entropy at this node is calculated by:

Next, assume the selected attribute is Based on , the data points in S are divided into K child nodes with the number of points in each child node, respectively

We define the weighted sum of entroy of each child node - calculated similarly to (2) Getting this weight is important because nodes often have different number of points

Then, we define information gain based on attribute :

In ID3, at each node, the selected attribute is determined based on:

2.2.3.3 Classification and Regression Trees (CART)

CART use Gini impurity Gini impurity is based on squaring component probabilities for each target in the node Its value reaches a minimum (zero) when all cases in the node fall into a single target category Suppose y receives values in {1, 2, , m} and calls f (i, j) the frequency of the value j in node i That is, f(i, j) is the ratio of the records with y = j to be classified in group i Gini impurity is calculated as:

In linear regression, we have y, a dependent variable that can be modeled as a function of an independent variable x: in which ϵ is the irreducible error Assume that the function f defines a linear relationship

In another words, we need to find the parameters θ0 and θ1 which define the intercept and slope of the line respectively:

THE METHOD

Problem formulation

Let T 1, T 2, …, T N be N previously tasks with D i be the dataset of T i, correspondingly, for i=1, 2,…, N Let S be the knowledge base, which includes all knowledge, information from N previous tasks S is empty when N=0

Let T N+1 be a new task (called the current task), with its dataset D N+1 The problem is to determine a set of previously datasets D close , which includes previously datasets D i closed with D N+1 , then using the part of knowledge of S, which related with D close for solving the current task T N+1

Assume that there exists a general representation space F, in which all data from all domains are representable Assume that the sets of values of all dimensions of F are discrete Moreover, there exists a similar measure sim(x, y) of a pair of elements x, y of F Let x be an element in F, X be a subset of elements in F, sim (x,X) be defined as the maximum similar of x and all elements of X (sim (x,X) = max {sim (x,y)| y  X}; “the complete link”).

The closeness of previous datasets to the current dataset

Assume that  i be the similar threshold of the i th previous datasetD i , i=1,2, … N.The similar threshold  i is a previous knowledge determined based on the set of {sim

Definition 1 Let x be an element belonging to the current dataset D N+1 The i th previous datasetD i is called close in the similar measure to x iff sim(x, D i )  i

Definition 2 (closure in the similar measure) The i th previous dataset D i is called close in the similar measure to dataset D N+1 iff

Assume that all data of the i th previous datasetD i be labelled and L i be the label set Let  i = ( i1 ,  i2 , …,  i|Li| ) be the probability threshold vector of the dataset D i , i=1,2, … N The probability threshold vector  i is a previous knowledge, which is determined based on the set of the posterior probability vector {(prob (l i1 |x), prob (l i2 |x), …, prob (l i|Li| |x))| x

Definition 3 Let x be an element belongs to the current dataset D N+1 , let  i be the probability threshold vector of the i th previous datasetD i , i=1,2, … N D i is called close to x iff there exists a posterior probability of x in D i is greater or equal to the corresponding probability threshold, i.e j{1, 2, , | L i || such prob (l ij |x)  ij

Definition 4 (closure in the probability) The i th previous dataset D i is called close in the similar measure to dataset D N+1 iff

Assume that the i th previous task T i be a binary classification with the classifierm i

Definition 5 Let x be an element belongs to the current dataset D N+1 The i th previous dataset D i is called close in classifierm i to x iff m i (x) is positive

Definition 6 (closure in the probability) The i th previous dataset D i is called close in classifierm i to dataset D N+1 iff

The closeness of two datasets

Similarity, the closure of two previous datasets are defined in the similar measure, in the probability, and in the classifier “D i and D j is called close” iff “D i is closed to D j ” and “D i is closed to D j ” .{D 1 , D 2 , …, D N } may be clustered for later uses.

Proposed model of lifelong topic modeling using close domain knowledge for multi-

Figure 5 The lifelong topic model using close domain knowledge for multi-label classification

The framework for multi-label text classification is described in Figure 5 Process of framework is described as follows:

First, we use one of three proposed approaches to find close domains

After that, the close domains will be used to train AMC model to exploit the prior domain knowledge and to adjust the distribution hidden topic on the current domains

Then, the set of Hidden Topics from the close domain is used to build new features set for the texts These features are considered better than those extracted from any domains

Finally, a multi-label classifier will be built to classify new documents We use different classifiers to evaluate the effectiveness of proposed method in exploiting prior knowledge from close domain

RESULTS AND DISCUSSIONS

The datasets

In our approaches, we focus on the impact of different domains on the model We divided the original dataset with N+1 label into four sub-datasets namely D1, D2, D3 and D4 as follows:

- D1 includes documents with the label set belonging to the set of N label

- D2 includes documents with the label set is the set of N label

- D 3 includes documents with the label set belonging to the set of full label (N+1 label)

- D4 includes documents with the label set is the set of full label (N+1 label)

We use D 1 , D 2 and D 3 as the prior domains and D 4 as current domains

In our experiments, we use the dataset of more than 1000 reviews on hotels which may have multi-labels on the label set of Location and price, Services, Facilities, Room standard and Food We divide the original dataset into 5 sub-datasets named D 1 , D 2 , D 3 (the three previous domains), D 4 (the current domain) and D test (dataset for testing); the details of these domains are given in Table x

D1 400 Includes documents whose labelset belonging to the set of 5 labels

D2 400 Includes documents whose labelset is the set of 5 labels

D3 400 Includes documents whose labelset belonging to the set of 5 labels

D4 50 or 100 Includes documents whose labelset is the set of 5 labels

Dtest 100 Includes documents whose labelset is the set of 5 labels

Experimental scenarios

We took several experiments with different configurations below:

- We use two kinds of features for data presentation of Term Frequency (TF) and TFIDF (Term Frequency and Inverse Document Frequency)

- For approach finding close domain using similar measure, we use cosin measure

- For approach finding close domain using classifiers, we use state of the art algorithms of Nạve Bayes, Logistic Regression

- For multi-label classifier, we use Binary Relevant method with core algorithms of kNN, Decision Tree, Gaussian Process, Random Forest, MLP and Ada Boost and Gaussian Nạve Bayes

+ With kNN, we set n_neighbors = 5

+ With Random Forest, we set max_depth = 5, n_estimators = 10, max_features=1 + With MLP, we set alpha = 1

+ With others, we set default

- For threshold of the closeness between previous domains and the current domain, we set value of threshold of 0.1

We performed four groups of experiments with different settings to evaluate the effectiveness of the proposed framework as follows:

- Experiment 1 (denoted by the OF - Original Features): Runs multi-classifier on the original features of TF and TFIDF without prior close domain knowledge

- Experiment 2 (denoted by the CMP – the Close domains finding Method with Probability approach): Run multi-classifier on the original features combining with features from close domain derived from probability approach

- Experiment 3 (denoted by the CMS – the Close domains finding Method with Similarity approach): Run multi-classifier on the original features combining with features from close domain derived from similar measure approach

- Experiment 4 (denoted by the CMC – the Close domains finding Method with Classifier approach): Run multi-classifier on the original features combining with features from close domain derived from classifier approach

Experimental results and discussions

The tables below show experimental results on the Vietnamese review dataset Hotel using lifelong machine learning with different classify methods

Table 2 The experimental results with 50 reviews in D4 and using kNN, Decision Tree as classifying methods

Table 3 The experimental results with 50 reviews in D4 and using Random Forest, MLP, AdaBoost, Gaussian Nạve Bayes as classifying methods

Precision Recall F1 Precision Recall F1 Precision Recall F1

Experiment Feature kNN Decision Tree Gaussian Process

Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1

MLP Ada Boost Gaussian NB

Table 4 The experimental results with 100 reviews in D4 and using kNN, Decision Tree as classifying methods

Table 5 The experimental results with 100 reviews in D4 and using Random Forest, MLP, AdaBoost, Gaussian Nạve Bayes as classifying methods

The results of the experiments using lifelong machine learning on the Vietnamese hotel review dataset are shown in Table 2, Table 3, Table 4 and Table 5 In general, we can see that lifelong machine learning help learn the current task better than not using it It can be said that lifelong machine learning is a solution for learning new task which has a very small of data

In these experiments, although lifelong machine learning showes small improvement, we believe that this approach has a huge potential of using past knowledge to help learn the new task with much better improvement In the future, we will study deeper on this approach to get even better results

Precision Recall F1 Precision Recall F1 Precision Recall F1

Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1

Random Forest MLP Ada Boost Gaussian NB

Tiêu đề	Lifelong Machine Learning Methods and Its Application in Multi-Label Classification
Tác giả	Nguyen Minh Chau
Người hướng dẫn	Assoc. Prof. Ha Quang Thuy
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2019
Thành phố	Hanoi

Định dạng
Số trang	52
Dung lượng	1,61 MB