Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with WordLevel Attention44978

2018 5th NAFOSTED Conference on Information and Computer Science (NICS) Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with WordLevel Attention Van-Nhat Nguyen1, Ha-Thanh Nguyen1, Dinh-Hieu Vo1, Le-Minh Nguyen2 VNU University of Engineering and Technology Japan Advanced Institute of Science and Technology applied in a variety of areas, such as studying about the users’ primary business trends, disease prevention, crime prevention, bioinformatics, stock analysis, etc Not only that, knowledge base (KB) such as Freebase [1] or DBpedia [2] still need a lot of knowledge to improve, so we can use IE systems to expand this knowledge base Abstract— With the explosion of information technology, the Internet now contains enormous amounts of data, so the role of information extraction systems becomes very important Relation Extraction is a sub-task of Information Extraction, which focuses on classifying the relationship between the entity pairs mentioned in the text In recent years, despite the many new methods have been introduced, Relation Extraction still receives attention from researchers for languages in general and Vietnamese in particular According to Jiang [3], the information extraction task has emerged since the 1970s (DeJong’s FRUMP program) but has only begun to attract attention when DARPA (Defense Advanced Research Projects Agency) initiated and sponsored the Message Understanding Conferences (MUC) in the 1990s Extracting information is a bigger task that involves several sub-tasks such as Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction, etc These sub-tasks are closely interrelated in the bigger task, for example, NER is considered a preprocessing task for a more complex task which is Relation Extraction Relation Extraction can be addressed in a variety of ways, including supervised learning methods, unsupervised and semisupervised methods Recent studies in the English language have shown that Relation Extraction using deep learning method in the supervised or semi-supervised domains is achieving optimal and superior results over traditional nondeep learning methods However, researches in Vietnamese are few and in the process of searching documents, the results of deep learning applying for Relation Extraction in Vietnamese are not found Therefore, the research focuses on studying and research the method of using deep learning to solve Relation Extraction task in Vietnamese In order to solve the Relation Extraction task, the research proposes and constructs a deep learning model named Piecewise Convolution Neural Network with Word-Level Attention Relation Extraction is a sub-task in information extraction, focused on recognizing and classifying relationships between entities in sentences or text Relationships derived from RE systems can be applied to many tasks such as Q&A systems, biomedical text mining, and medical support Thus, in recent years, the Relation Extraction task has been receiving great attention from researchers worldwide There have been a lot of researches raised in important conferences such as Colling, ACL, Senseval, etc Relation Extraction is also a part of knowledge mining international projects such as Automatic Content Extraction (ACE), Global WordNet, etc Keywords— Relation Extraction, deep learning, convolution neural network, attention mechanism I INTRODUCTION With the explosion of information technology, the Internet now contains an enormous amount of data According to intemetlivestats’s statistics, up to now there have been over 1,868,000,000 websites; For every second, there are over 2,683,187 emails sent, over 66,423 Google searches, 8,003 Twitter posts, 840 Instagram photos, over 1,366 Tumblr posts, 3,090 Skype calls, over 73,450 YouTube videos, 55,730 GB of Internet traffic data, etc and these numbers continue to increase In recent years, despite not a lot of research works, the Relation Extraction task continues to receive attention from a large number of researchers around the world In particular, Relation Extraction is a task that has the potential of applying deep learning For English, in recent work [4, 5, 6], it has been shown that the application of deep learning in this task is more effective than traditional non-deep learning Relation Extraction is also a task mentioned in the information extraction task in Vietnamese text However, the researches that went into Relation Extraction of the Vietnamese text are still limited and in the process of reviewing documents, the results of deep learning applied to Relation Extraction in Vietnamese are nowhere to be found Data on the Internet contains enormous amounts of information, but most exist in the form of unstructured text, so there is also redundant information causing difficulties in analyzing data Therefore, the information extracting systems play a very important role in extracting meaningful information from the data for analysis Information Extraction (IE) is a field of study in natural language processing (NLP) related to extracting structured information (which can easily be interpreted by a type of data) from an unstructured text Data extracted by IE systems can be 978-1-5386-7983-8/18/$31.00 ©2018 IEEE The objective of this research was to investigate and study to give a deep learning model for the Relation Extraction task in Vietnamese In order to reach this goal, the research shall study and introduce methods of solving 99 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) Relation Extraction problems and some models in each method From a number of ideas in the researches [4, 6, 7], the study has developed a deep learning model called Piecewise Convolution Neural Network with Word-Level Attention In order to assess the effectiveness of the model, a Vietnamese dataset was developed and tested on the model that the study proposes II RELATED WORKS A Simple CNN models The simple convolutional neural network model [8] was the earliest work that tries to use CNN to automatically learn features instead of hand-crafting features First, this model encodes the input sentence using words embedding and lexical features, followed by a convolutional layer, a single neural network layer, and Softmax output layer using to give the probability distribution for all relation classes Convolution neural network model with max-pooling layer [9] also use CNN to encode sentence-level features But the other point is that they use a max-pooling layer on the output of the convolution layer This paper is also the first work to use positional-embedding This model also uses lexical-level features such as noun information in the sentence and hypernyms of the nouns on WordNet feature of sentence-level is calculated as the weighted sum of the word vectors The convolution feature vector and two attention-based context feature vector (for two entities) are concatenated before getting passed through a multi-layer perceptron with softmax activation With the result of 85.9% of the F1-score on the SemEval2010 Task [11] dataset, this model has proven to be effective when applied on a deep-learning model to solve the task of Relational extraction Multi-level Attention CNN [6] is perhaps the best performing model at present, with 88.0% of F1-score on the SemEval-2010 Task [11] dataset Their biggest contribution is the combination of two attention-based layers: attention on the input layer, and attention on the max-pooling layer We can see that the attention mechanism has a positive effect on the models III PROPOSED MODEL To solve the problem of Relation Extraction, the study proposes and builds a deep learning model called Piecewise Conventional Neural Network with Word-Level Attention This chapter will detail the architecture and process of building the model The general architecture of our model is shown in Figure Figure General architecture of the model Convolutional neural network with multi-sized window kernel [10] is based on the results of Liu [8] and Zhang [9] This model completely removes the lexical word-feature to enrich the representation of the input sentence and allows CNN to self-learn the necessary features Their architecture is similar to Zeng et al., consisting of words and positional embeddings followed by convolution and max-pooling In addition, they combined a convolutional matrix of different size window to capture n-gram level features Additionally, they also incorporate convolutional kernel of varying window sizes to capture wider ranges of n-gram features B CNN with Attention Mechanism models Attention-based Convolution Neural Network [4], uses word embedding, positional embedding, and incorporates part-of-speech features to construct the word vector Next are a convolutional layer and a max-pooling layer to obtain the convolution feature of the sentence-level The attention weight is calculated by letting the model self-learn to calculate the correlation between the words in the sentence and the two entities in the sentence Attention-based context 100 A Input representation Suppose the input sentence is a sequence of words ൌ ሾଵ ǡ ଶ ǡ ǥ ǡ ୬ ሿ with the length n, two entities of the sentence are ଵ ൌ ୮ and ଶ ൌ ୲ ሺǡ ‫ א‬ሾͳǡ ሿǢ ് ሻ Similar to the mentioned models, the research uses word embedding and positional embedding to encode each word ୧ into a vector ୧୑ First, the model uses word embedding to capture the semantics of each word Given a matrix with word embedding ୚ with size of ȁȁ ൈ ୵ , where is the vocabulary and ୵ is the embedded dimension Each word ୧ will be searched in the embedded matrix to retrieve the vector from ୧ୢ ‫ א‬Թୢ౭ Next, the model uses positional embedding to capture the characteristic of the distance between each word in the sentence to two entities First, the relative distance of each word to the two entities in the sentence is calculated For a set of input sentences, two sets of relative positions will be used for separate pre-training with Word2Vec [12] to obtain 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) two positional embedding matrices Then, two relative distances of each word will be searched in the respective embedding matrix to retrieve two positional embedding vector ǡͳ and ǡʹ of the same size Finally, the vector representation of the word is a concatenation of three vectors and has a size of ൌ ୵ ൅ ʹ ‫ כ‬୮ ൌ ْ ǡͳ ْ ǡʹ ǡ Where: ْ is the vector concatenation operator B Attention mechanism of word – level According to the studies of Huang et al [4] and Wang et al [6], the words in the sentence contain different levels of importance for predicting the relationship between the pair of entities For example “[Hoàng Văn Trà]e1 sinh ӣ xã Nghi Hѭng, huyӋn Nghi Lӝc, tӍnh [NghӋ An]e2” (“[Hoang Van Tra] e1 was born in Nghi Hung commune, Nghi Loc district, [Nghe An] e2 province”) relationship between the two entities is "Hometown"; the word “sinh” (“born”) is the most important information to predict the relationship between the pair of entities So we need to train the model to focus the attention on words that carry such important information Ƚ୧ ൌ ሺ୧ ሻ σ୧ ሺ୧ ሻ After obtaining attention weight, the new representation vector ୧ of the word will be calculated as the old representation vector multiplied by the attention weight ୧ ൌ Ƚ୧ ୧୑ C The convolutional layer Similar to deep learning models in Relation Extraction, this model also uses a convolutional layer with windows of to capture the tri-gram level features Assuming the convolution layer has m filters, the output value of the jth filter ሺ ‫ א‬ሾͳǡ ሿሻ for each word ୧ ሺ ‫ א‬ሾͳǡ ሿሻ is calculated by the convolution of the filter matrix ୡౠ ‫ א‬Թଷൈୢ of jth filter with representation matrix of three-word phrase ሾ୧ିଵ ǡ ୧ ǡ ୧ାଵ ሿ, followed by a activation function: ୧୨ ൌ ቀୡౠ ሾ୧ିଵ ǡ ୧ ǡ ୧ାଵ ሿ୘ ൅ ୡౠ ቁ Where: ୡౠ ‫ א‬Թଷൈୢ is filter matrix of the th filter, is the matrix displacement operator Then the representation of the sentence for the filter is a vector: ୨ ൌ ൣଵ୨ ǡ ଶ୨ ǡ ǥ ǡ ୬୨ ൧ Figure Attention mechanism of word – level (shown in Figure 2) The research proposes a word level attention mechanism, all word representation vectors are multiplied by an attention weight (learned by the model) First, we connect the word embedding vector of each word with two-word embedding vectors of two entities, the resulting vector called ୧ ୧ ൌ ୧ୢ ْ ୮ୢ ْ ୲ୢ ǡ ሺǡ ‫ א‬ሾͳǡ ሿǡ ് ሻ Next, ୧ is passed through a fully-connected layer to compute the correlation between each word with two entities, namely ୧ So with filters, the representation matrix of sentence will be: ൌ ሾଵ ǡ ଶ ǡ ǥ ୫ ሿ D Piecewise max-pooling layer The purpose of the max-pooling layer is to obtain the most prominent value on the output of each filter of the convolution layer The research uses the piecewise maxpooling layer of Zeng et al [7] The two entities in the sentence will separate the representation Sj of the sentence into three parts ୧ ൌ ୳ ୧ ൅ ୳ ୨ ൌ ൣ୨ଵ ǡ ୨ଶ ǡ ୨ଷ ൧ Finally, the weight of attention Ƚ୧ for each word is calculated by applying the softmax function on the correlation vector of the sentence The model will get the maximum value per section, ୨ ൌ ൣ୨ଵ ǡ ୨ଶ ǡ ୨ଷ ൧ with ୨୩ ൌ ൫୨୩ ൯ Finally, the 101 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) representation of sentence will be the concatenation of the vectors ୨ : ‫ כ‬ൌ ଵ ْ ଶ ْ ǥ ْ ୫ E Lexical features The studies of Huang et al [4] and Zeng et al [9] show that information from two entities and words around them are very important Base on these studies, this model also uses a lexical feature vector of entities and two words around them Suppose ୮ ǡ ୲ ሺǡ ‫ א‬ሾͳǡ ሿǢ ് ሻ are word embeddings of two entities, the lexical feature vector will be computed: ൌ ୮ିଵ ْ ୮ ْ ୮ାଵ ْ ୲ିଵ ْ ୲ ْ ୲ାଵ F Output After obtaining the feature vector ‫ כ‬of the sentence and the lexical feature vector , we concatenate them into a single vector ୭ ൌ ‫ ْ כ‬ Next, this vector is passed through a fully-connected layer with a softmax activation function to obtain an output vector: ൌ ሺ୭ ୭ ൅ ୭ ሻ ൌ ሾଵ ǡ ଶ ǡ ǥ ǡ ୪ ሿ where is the number of relation classes, ୰ ሺ ‫ א‬ሾͳǡ ሿሻ is the probability that the sentence is predicted to represent the relation The relationship predicted by the model for sentence will be the highest probability relation: ୗ ൌ ሺሻ IV Table II: The list of model’s parameters Parameters Word embeding size (୵ ) Position embeding size (୮ ) Number of convolution filters () Convolution window size Batch size Number of epochs Patience Learning rate Table I: Number of sentences in the dataset Number of sentences 358 632 435 291 1,716 We evaluate the model using the Macro-F1 score over relations (excluding Other) The study also used the k-fold cross validation with ൌ ͷ Value 100 100 10 100 0.01 C Expreriment results With the dataset and parameters as mentioned, the proposed model achieved macro-averaged F1-score of 94.89% In addition, we evaluated different designs of the neural network to evaluate the performance of three techniques: lexical feature, piecewise max-pooling, and word-level attention The results are shown in Table III Design EXPERIMENTS A Dataset and Evaluation Metrics The dataset of the study is collected automatically from Vietnamese Wikipedia pages about people1 Based on the parameters (living_place, occupation), we extract the "human-place" and "human-occupation" entity pairs Next, from the pages of human entities, we analyze all the sentences containing the specified entity pairs and automatically assign them a label For example, humanoccupation pairs are assigned to the Occupation label, sentences containing the keywords such as "born" and "home" are marked as “Hometown" label Finally, the data is reviewed and re-labeled manually for avoiding mistakes This dataset includes 1716 sentences with relations and one Other relation, which are described in detail in Table I: Relation name Hometown Workplace Occupation Other Total B Experimental setup For word embeddings, the study using a Word2Vec model which was pre-trained on Vietnamese Wikipedia corpus In training, the study uses Adam optimizer with a learning rate of 0.01 and Early Stopping with a patience of The remaining parameters are described in Table II Table III: Experimental results of the designs F1 Lexical Piecewise Word-level Score Feature Max-pooling Attention 90,77 D 94,12 D 92,65 D 91,38 D D D 94,89 Based on Table III above, we can see that the combination of the three techniques brings the best results on the dataset D Errors analysis Analyzing the errors in the experiment, we found that our model possesses the general weakness of machine learning algorithms: data dependence Most of the errors in our experiment are related to the label Other The first reason is the number of examples belonging to this label in our dataset is small, for only 16.96% The second reason is the ambiguity of the labels, sentences in labels Hometown, Workplace, Occupation can be wrongly classified into label Other and vice versa The third reason is that the label Other has no specific features For example, “Tӯ năm 1967 ÿӃn năm 1968, [Thanh TuyӅn]e1 hát song ca [ca sƭ]e2 ChӃ Linh…” (From 1967 to 1968, [Thanh Tuyen] e1 sang duets with [singer]e2 Che Linh ), our model recognized this sentence’s label is Occupation, but it is Other In this case, our model has not learned the characteristics of the sentence that has the label Other, so it based on the sign of “Thanh Tuyen”, “Che Linh” and “singer” to classify sentences into label Occupation https://vi.wikipedia.org/wiki/ThӇ_loҥi:Danh_mөc_ngѭӡi_ theo_tham_sӕ 102 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) As we can see, there are still some problems in our model These issues will continue to be addressed in subsequent studies V CONCLUSION AND FUTURE WORK In this study, we have researched a number of deep learning models solving Relation Extraction problems in English and proposed a Relation Extraction model working with Vietnamese texts In our experiment, we constructed a set of Vietnamese data collected from Vietnamese Wikipedia and obtained 94.89% of the F1 score on this dataset However, the experimental data and the number of relations of this study is limited and only focuses on a specific domain related to human In the future, we will continue to improve the quality of the data The model obtained in this study can also be used in labeling process for additional data We will also apply this research result on some specific domains like legal engineering ACKNOWLEDGEMENT This work has been supported by Vietnam National University, Hanoi (VNU), under Project No QG.16.91 REFERENCES [1] [2] BOLLACKER, Kurt, et al Freebase: a collaboratively created graph database for structuring human knowledge In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data AcM, 2008 p 1247-1250 AUER, Sören, et al Dbpedia: A nucleus for a web of open data In: The semantic web Springer, Berlin, Heidelberg, 2007 p 722-735 103 JIANG, Jing Information extraction from text In: Mining text data Springer, Boston, MA, 2012 p 11-41 [4] HUANG, Xuanjing, et al Attention-based convolutional neural network for semantic Relation Extraction In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers 2016 p 2526-2536 [5] HUANG, Yi Yao; WANG, William Yang Deep Residual Learning for Weakly-Supervised Relation Extraction arXiv preprint arXiv:1707.08866, 2017 [6] WANG, Linlin, et al Relation classification via multi-level attention cnns In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 p 1298-1307 [7] ZENG, Daojian, et al Distant supervision for Relation Extraction via piecewise convolutional neural networks In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015 p 1753-1762 [8] LIU, ChunYang, et al Convolution neural network for Relation Extraction In: International Conference on Advanced Data Mining and Applications Springer, Berlin, Heidelberg, 2013 p 231-242 [9] ZENG, Daojian, et al Relation classification via convolutional deep neural network In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers 2014 p 2335-2344 [10] NGUYEN, Thien Huu; GRISHMAN, Ralph Relation Extraction: Perspective from convolutional neural networks In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing 2015 p 39-48 [11] HENDRICKX, Iris, et al Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions Association for Computational Linguistics, 2009 p 94-99 [12] MIKOLOV, Tomas, et al Distributed representations of words and phrases and their compositionality In: Advances in neural information processing systems 2013 p 3111-3119 [3] ... Springer, Berlin, Heidelberg, 2007 p 722-735 103 JIANG, Jing Information extraction from text In: Mining text data Springer, Boston, MA, 2012 p 11-41 [4] HUANG, Xuanjing, et al Attention-based convolutional... Relation Extraction problems in English and proposed a Relation Extraction model working with Vietnamese texts In our experiment, we constructed a set of Vietnamese data collected from Vietnamese. .. convolutional neural networks In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015 p 1753-1762 [8] LIU, ChunYang, et al Convolution neural network for Relation Extraction

Định dạng
Số trang	5
Dung lượng	171,52 KB