THE MINISTRY OF EDUCATION AND TRAINING THE UNIVERSITY OF DANANG VO DUY THANH AN APPLIED RESEARCH OF SEMI-SUPERVISED LEARNING TECHNOLOGY IN VIETNAMESE TEXT CLASSIFICATION FIELD Major : COMPUTER SCIENCE Code : 62 48 01 01 SUMMARY OF DISSERTATION FOR DOCTOR OF ENGINEERING Da Nang - 2017 THE RESEARCH WAS ACCOMPLISHED AT THE UNIVERSITY OF DANANG Advisors: Assoc Prof Dr Vo Trung Hung Assoc Prof Dr Doan Van Ban Reviewer 1: Prof Dr Nguyen Mau Han Reviewer 2: Prof Dr Phan Huy Khanh Reviewer 3: Prof Dr Huynh Thi Thanh Binh The dissertation was defended in front of The Dissertation Grading Council at The University of Danang level at The University of Danang on September 29 th 2017 You can find the dissertation at: - National Library of Vietnam; - Learning Information Center, The University of Danang INTRODUCTION Reasons for choosing the topic Nowadays, the rapid development of science technology as well as information technology has brought people many abilities for approaching the information quickly and conveniently such as: electronic library, electronic portal, search application… These things help people more conveniently in exchanging, updating, searching for information all over the world through the Internet Therefore, operating the automatic document classification nowadays is considered as an urgent problem and it attracts many researchers as well In this dissertation, the author focused on investigating new methods for Vietnamese text classification more effectively which based on semi-supervised learning technology Literature review In computer science field, semi-supervised learning is a machine learning technology class which combined the using of labeled data and unlabeled data in training The quantity of labeled data is usually less than the quantity of unlabeled data because it requires a lot of time for labeling the data Many researchers in machine learning field proposed that the combination of unlabeled data and a small quantity of labeled data can present many significant innovations in accurate learning a Domestic literature review b International literature review Research target The general target of this study is to investigate the application of semi-supervised learning technology classification in Vietnamese text Research objects and scope Research objects: - Semi-supervised learning technology; - Classification algorithms, clustering data in structured and semistructured data space; - Focusing on Vietnamese text classification Research content - Determining a function or a method which enables to classify data layers efficiently (usually two layers); - Making predictions about layers for unlabeled data; - Examining the impact of the number of unlabeled data to the results of the algorithm; - Developing testing software for Vietnamese text classification Research methodology - Documentation methodology - Empirical rmethodology - Expert methodology Main contributions of the dissertation Main contributions of the dissertation include: Proposing a new methodology in text classification based on Geodesic model and graph theory Proposing solutions reducing the dimensionality of a vector for text classification based on Dendrogram Building a data warehouse for Vietnamese text classification Dissertation structure Main contents of the dissertation are presented in chapters: Chapter 1: Literature review Chapter 2: Building a data warehouse Chapter 3: Text classification based on Geodesic model Chapter 4: Reducing the dimensionality of a vector based on Dendrogram Chapter LITERATURE REVIEW 1.1 Machine learning 1.1.1 Definition 1.1.2 Application of machine learning 1.2 Machine learning methodologies 1.2.1 Supervised learning 1.2.2 Unsupervised learning 1.2.3 Semi-supervised learning 1.2.4 Reinforcement learning 1.2.5 Deep learning 1.3 Overview of semi-supervised learning 1.3.1 Semi-supervised learning methodologies - Expectation–maximization algorithm - Transductive SVM Figure 1.1 Maximum-margin hyperplane - Self-training algorithm Figure 1.2 Visual performance of Selftraining setup - Co-training algorithm Figure 1.3 Visual performance of Co-training setup 1.3.2 SVM supervised learning algorithm and SVM semisupervised learning algorithm - Introduction - Support vector machine (SVM) algorithm Figure 1.4 Example of binary classification 1.3.3 SVM in text classification 1.3.4 Semi-supervised SVM and website classification 1.3.5 Typical text classification algorithm 1.4 Text classification 1.4.1 Text 1.4.2 Displaying text by vector Figure 1.5 Displaying model text by specific vectors 1.4.3 Text classification a General model Figure 1.6 General model of text classification system b Classification steps 1.5 Proposed research General model for text classification is presented as the figure below Figure 1.7 Text classification model Figure 1.8 The proposed classification model 1.6 Conclusion Chapter BUILDING A DATA WAREHOUSE 2.1 Introduction of data warehouse for Vietnamese text classification a Introduction b Purpose of the data warehouse for Vietnamese text classification 2.2 Overview of the data warehouse 2.2.1 Definition of the data warehouse 2.2.2 Characteristics of the data warehouse 2.2.3 Purpose of the data warehouse 2.2.4 Data warehouse architectures a Data warehouse architecture basic Figure 2.1 Architecture of a data warehouse b Data warehouse architecture with a staging area Figure 2.2 Architecture of a data warehouse with a staging area Components of the data warehouse: - Data Sources - Staging Area - Metadata - Data Warehouse - Data Marts 2.3 Requirements Analysis 2.3.1 Data warehouse building No Table 2.1 Downloaded raw data Classification Number of Total size downloaded articles Sport 1512 363411 KB Education 1231 335561 KB Law 1194 175410 KB International 1208 255815 KB Society 1152 232633 KB 2.3.2 Data warehouse exploration 2.3.3 Data warehouse update 2.4 Data analysis and specification 2.5 Data warehouse building methodology 2.5.1 A proposed general model Step Step Step Figure 2.3 The proposed general data warehouse model 2.5.2 Process of building a data warehouse 2.5.3 Process of text classification program Figure 2.4 Text classification process a Data preprocessing b Text display Vector space model Figure 2.5 Vector model in 3D space 2.5.4 Text classification using Naïve Bayes algorithm Text Text Text Text Text Text Text Confident 44 12 14 35 29 10 Table 2.2 Training data Creative Ingenious Enthusiasm 28 58 31 40 26 24 42 10 47 34 11 64 24 32 2.5.5 Formatting the data outputs in data warehouses a Formatting the sample text Class Sport Society Society Sport Sport Society 3.3.1 Application development 3.3.2 Data preparation Table 3.1 Counting the number of file in data warehouse No Type of documents Sport Education Law International Society Labelled 10 10 10 10 10 Training Unlabelled 613 604 577 599 584 Test Total 400 400 400 400 400 1023 1014 987 1009 994 3.3.3 Program deployment - Training function - Text classification function 3.3.4 Results a The first experiment Table 3.2 The first classification result with the use of the traditional SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 58 78 887 225 159 516 24 62 864 64 16 895 108 277 253 Average rate of successful classification Society 114 37 34 356 Accuracy % 86.7% 51.0% 87.5% 88.7% 35.8% 69.9% Table 3.3 The first classification result with the use of the proposed SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 105 34 115 769 104 89 821 25 44 47 864 17 23 21 932 74 67 172 326 Average rate of successful classification 11 Society 0 10 16 356 Accuracy % 75.2% 81.0% 87.5% 92.4% 35.7% 74.4% The average rate of successful classification on all topics is 69.9% with the traditional SVM and 74.4% with the proposed method b The second experiment Table 3.4 The second classification result with the use of the traditional SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 63 34 868 43 888 35 878 18 122 826 45 29 502 29 Average rate of successful classification Society 58 83 68 43 389 Accuracy % 84.8% 87.6% 89.0% 81.9% 39.1% 76.5% Table 3.5 The second classification result with the use of the proposed SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 0 184 808 0 279 676 0 276 593 15 0 899 0 54 378 Average rate of successful classification Society 31 59 118 95 562 Accuracy % 79.0% 66.7% 60.1% 89.1% 56.5% 70.3% c The third experiment Table 3.6 The third classification result with the use of the traditional SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 295 721 0 234 763 22 291 674 19 990 51 83 557 Average rate of successful classification Society 17 0 303 Accuracy % 70.5% 75.2% 68.3% 98.1% 30.5% 68.5% Table 3.7 The third classification result with the use of the proposed SVM Label from classification results Actual label Accuracy Sport Education Law Internation Society % 12 Sport Education Law International Society 126 147 750 117 18 879 81 41 804 33 242 720 74 261 208 Average rate of successful classification 0 23 14 451 73.3% 86.7% 85.1% 71.4% 45.3% 72.4% d The fourth experiment Table 3.8 The fourth classification result with the use of the traditional SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 25 22 217 759 14 71 179 737 48 181 689 21 54 68 808 83 177 158 Average rate of successful classification Society 13 69 58 573 Accuracy % 74.2% 72.7% 69.8% 80.1% 57.6% 70.9% Table 3.9 The fourth classification result with the use of the proposed SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 25 28 136 834 14 31 179 778 50 178 689 21 52 54 824 83 209 156 Average rate of successful classification Society 12 70 56 543 Accuracy % 81.5% 76.7% 69.8% 81.7% 54.6% 72.9% e The fifth experiment Table 3.10 The fifth classification result with the use of the traditional SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 34 19 194 776 14 75 179 725 46 184 692 12 41 54 805 11 83 241 156 Average rate of successful classification 13 Society 21 65 97 503 Accuracy % 75.9% 71.5% 70.1% 79.8% 50.6% 69.6% Table 3.11 The fifth classification result with the use of the proposed SVM Label from classification results Actual label Sport Education Law International Society Sport Education Law Internation 26 43 218 736 121 42 799 17 35 98 795 27 134 792 49 51 168 153 Average rate of successful classification Society 52 42 56 573 Accuracy % 71.9% 78.8% 80.5% 78.5% 57.6% 73.5% Figure 3.4 The average value and the variance of the rate classification based on the traditional SVM and the proposed method The figure above shows the average value and the variance of the successful rate of classification using traditional SVM and the proposed method 3.4 Conclusion In this chapter, the author presented the results of text classification based on the proposed model which combined Geodesic model and support vector machine The Geodesic model uses the shortest correlation (the adjacent level between texts) to calculate the distance between two vectors This Geodesic distance is different from an Euclidean distance and helps to increase the accuracy of automatic 14 text classification, allow to classify many types instead of two types (based on binary subclass) Chapter REDUCING THE DIMENSIONALITY OF A VECTOR BASED ON DENDROGRAM This chapter presents the proposed solution to reduce the dimensionality of a vector displaying Vietnamese text based on Dendrogram and documents taken from Wikipedia Reducing the dimensionality of a vector will be applied in Vietnamese text classification through experiments 4.1 Introduction 4.1.1 Definition of Dendrogram - Definition Figure 4.1 Dendrogram 4.1.2 Proposed methodology Figure 4.2 An example about Dendrogram 4.2 Building Dendrogram from Wikipedia data 15 4.2.1 Wikipedia processing algorithm Figure 4.3 Diagram of Wikipedia data processing algorithm 4.2.2 Dictionary processing algorithm Figure 4.4 Diagram of dictionary processing algorithm 4.2.3 P matrix calculation algorithm for common appearing frequency 4.2.4 Algorithm for building Dendrogram 4.2.5 Cluster analysis a Wikipedia processing b Dictionary c Calculating the matrix of common appearing frequency d Data organizing in program 16 4.2.6 Experiment 4.2.6.1 System structure 4.2.6.2 Functions a Clustering function Figure 4.5 Example of cutting Dendrogram, three groups are received b Building classification model function c Classification function 4.2.6.3 Results Clustering the dictionary shows the results as follow Figure 4.6 The number of pairs of words according to the common appearing frequency 17 Figure 4.7 The number of groups based on clustering on Dendrogram Cutting the dendrogram at 20% of the maximum distance gives a set of related words or synonyms as follow: Figure 4.8 The result of using dendrogram to clustering Figure 4.9 Another example shows words related to music 18 Figure 4.10 An example of Dendrogram about words Figure 4.11 An example shows words related to medicine 4.3 Applying words clustering into text classification 4.3.1 Input data 4.3.2 Experiment results a Training model Table 4.1 Training data, testing No Training Type of st document nd time time 3rd time Testing 4th time 5th time Sport 15 20 40 80 120 400 Education 15 20 40 80 120 400 Law 15 20 40 80 120 400 International 15 20 40 80 120 400 Society 15 20 40 80 120 400 19 Figure 4.12 The storage capacity of vectors depends on the number of words Figure 4.13 Time of labeling of times training b Text classification c Accuracy of text classification Figure 4.14 Average time for Figure 4.15 Classification rates of classifying text of times training times training 20 d The average accuracy of text classification Figure 4.16 The change of results according to the classification rate Based on the figure above - reducing the dictionary can improve the accuracy of classification, if we choose the correct reduction rate for the dictionary (from 30% -> 70%) in accordance with initial vector space, the rate of text classification is higher than before – when we have not clustered and reduced words 4.4 Conclusion Results gotten through proposed methodologies aim to enhance the quality of Vietnamese text automatical classification The first methodology uses Wikipedia encyclopedia and Dendrogram in reducing the dimensionality of a vector which displays Vietnamese text The second methodology applies the reduced vector for text classification Experiments show that the utilization of reduced vector space based on Dendrogram and Wikipedia library not only saves storage capacity and time for Vietnamese text classification but also guarantees the accurate classification rate, text classification rate is higher than when have not clustered The limitation of proposed methodology is just tested the common appearing frequency of pairs of words in one page of Wikipedia to 21 cluster, therefore it can lead to the untruth in semantics if that Wikipedia page has too much information For example, one page covers much information about Sport, Law, Education… The following research will make good the limitations above CONCLUSION Achieved results In this dissertation, the author presents research results in Vietnamese text classification with the combination of semisupervised learning technology and support vector machine (SVM) And there are many achieved results as follow: - Built a data warehouse for Vietnamese text classification - Proposing and testing the text classification methodology based on Geodesic distance - Proposing and testing methodology for reducing the dimensionality of a vector when displaying Vietnamese text for increasing processing speed but still ensuring the accuracy when classify text Based on the results, the dissertation compared the proposed methodology which based on Geodesic distance to the traditional SVM model on the same data set The classification’s average rate of methodologies is not significantly different, however, the variance of the proposed method (± 2%) is smaller than that of the traditional SVM (± 4%) It suggests that the proposed method is more reliable than the traditional SVM for Vietnamese text classification Experiments show that the application of vector space which is reduced by Dendrogram and Wikipedia can not only help saving storage capacity and time for Vietnamese text classification but also 22 ensuring the correct classification rate in comparison with when hav not clustered At the 30% - 70% reduction rate of the initial vector space, the correct classification rate is higher than when have not clustered Limitation of the dissertation - Basically, the text classification program has almost completed the proposed functions such as helping users building the classification model for Vietnamese texts, automatically classifying new texts based on the established model However, the initial data collection is just at the experiment stage - The limitation of this dissertation is not using WORDNET or making the graph to consider the semantic correlation among words before building feature vectors for text This point can decrease the optimal ability when clustering - Reducing the dimensionality of a vector for text is just tested the common appearing frequency of pairs of words in one Wikipedia page to divide word groups, so it can cause wrong meaning if the Wikipedia has too many information such as one page includes information about Sport, Education, Law, International, Society… - The dissertation has just tested on support vector machine (VSM) - The dissertation has not compared to different Dendogram algorithms yet Next time, I will supplement several new functions and complete the program to enhance the effectiveness, at the same time, building a data warehouse enough for classifying text more correctly Proposal for future research Nowadays, text summarization is the research trend which attracts many scientists, especially in Vietnamese field which has many 23 issues needed to be investigated Therefore, the research trend about text summarization is still an open research In the limitation of the dissertation, I suggest further research trend of this topic such as: - Keep doing research about WORDNET which helps in looking up English semantics, from that building WORDNET for looking up Vietnamese Or using the graph to optimize the interaction ability when creating a feature vector for text - For enhancing the effectiveness of semi-supervised learning model combined with text content summarizzation, I will keep doing research about methodologies for Vietnamese word separation in order to increase the accuracy of the methodology for taking main idea from the text content, at the same time, doing many different content compressing tests to find out higher content compress rate in order to improve the accuracy of the results in text classification according to the proposed model - Testing with the the common appearing frequency in one paragraph, in one sentence - Testing with an other dataset apart from Wikipedia, for example, articles in Vietnam online newspapers - Testing with other machine learning methodologies and comparing different Dendrogram algorithms 24 LIST OF PUBLISHED SCIENTIFIC RESEARCH Vo Duy Thanh, Vo Trung Hung, Pham Minh Tuan, Doan Van Ban, “Text classification based on semi-supervised learning”, Proceeding of the SoCPaR 2013, IEEE Catalog number CFP1395HART, ISBN 978-1-4799-3400-3/13/$31.00, pp 238-242, 2013 Vo Duy Thanh, Vo Trung Hung, Phạm Minh Tuan and Ho Khac Hung, “Text Classification Based On Manifold Semi-Supervised Support Vector Mahcine”, Proceeding of the ISDA 2014, 14th International Conference on Intelligent Systems Design and Applications, Okinawa, Japan 27-29, November 2014, IEEJ catalog, ISSN: 2150-7996, pp 13-19 Pham Minh Tuan, Nguyen Thi Le Quyen, Vo Duy Thanh, Vo Trung Hung, “Vietnamese Documents Classification Based on Dendrogram and Wikipedia”, Proceedings of Asian Conference on Information Systems 2014, ACIS 2014, December 1-3, 2014, Nha Trang, Viet Nam, © 2014 by ACIS 2014, ISBN: 978-4-88686-089-7, pp 247-253 Vo Duy Thanh, Vo Trung Hung, Ho Khac Hung, Tran Quoc Huy, “Text Classification Based On SVM And Text Summarization”, International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181, Vol 4, Issue 02, February-2015, pp 181186 Vo Trung Hung, Nguyen Thi Ngoc Anh, Ho Phan Hieu, Nguyen Ngoc Huyen Tran, Vo Duy Thanh, “Comparison of the documents based on vector model”, In the Journal of Science and Technology, the University of Danang, ISSN: 1859-1531, No 3(112)-2017, pp 105-109 25 ... Application of machine learning 1.2 Machine learning methodologies 1.2.1 Supervised learning 1.2.2 Unsupervised learning 1.2.3 Semi- supervised learning 1.2.4 Reinforcement learning 1.2.5 Deep learning. .. target of this study is to investigate the application of semi- supervised learning technology classification in Vietnamese text Research objects and scope Research objects: - Semi- supervised learning. .. on semi- supervised learning technology Literature review In computer science field, semi- supervised learning is a machine learning technology class which combined the using of labeled data and