Improving Intent Extraction Using Ensemble Neural Network44895

2019 19th International Symposium on Communications and Information Technologies (ISCIT) Improving Intent Extraction Using Ensemble Neural Network 1st Thai-Le Luong 2nd Nhu-Thuat Tran Faculty of Information Technology VNU University of Engineering and Technology University of Transport and Communications Hanoi, Vietnam luongthaile80@utc.edu.vn Faculty of Information Technology VNU University of Engineering and Technology Hanoi, Vietnam thuattn@vnu.edu.vn 3rd Xuan-Hieu Phan Faculty of Information Technology VNU University of Engineering and Technology Hanoi, Vietnam hieupx@vnu.edu.vn Abstract—User intent extraction from social media texts is aimed at identifying user intent keyword and its related information This topic has attracted a lot of researches since its various applications in online marketing, e-commerce and business services One of such studies is to model this problem as a sequence labeling task and apply state-of-the-art sequential tagging models such as BiLSTM [12] and BiLSTM-CRFs [12] In this paper, we take a further step to enhance intent extraction results based on tri-training [23] and ensemble learning [2] Specifically, we simultaneously use three BiLSTM-CRFs models, each of them is different from others by the type of word embeddings, and apply majority voting scheme over their predicted labels when decoding final labels Extensive experiments on data from three domains Real Estate, Tourism and Transportation show that our proposed methods enjoy a better performance compared to single model based approach Index Terms—intent mining, intent identification, tri-training, ensemble, information extraction I INTRODUCTION In recent years, if one wants to look for a restaurant, a place of entertainment or even an apartment, he/she will come straight to a forum or a social network to share his/her intent and preference with a desire to get some recommendations from others These data will bring a lot of potential benefits to business if there is a mechanism to automatically understand and extract the user intents when they just appear on social media under the forms of posts or comments Most researches formerly focused on classifying user intents into some predefined categories using various approaches such as [6], [9], [10] There has been a minority of studies which attempt to exploit the semantic features or linguistic features of online texts to deeply understand the intention of user [3], [4], [13], [15] In our previous paper [15], we formulated intent extraction task as a sequential labeling problem For a particular intent domain, we attempted to extract intent head and intent properties Terminologically, intent head is intent-keyword and object; and intent properties include all possible constraints or 978-1-7281-5009-3/19/$31.00 ©2019 IEEE preferences like brand, price, date and so on In this paper, we follow above-described idea to extract user intent from Vietnamese online texts (i.e posts/comments) For example, with a post in the Tourism domain like “Our family is going to Da Nang from 14/6 to 18/6, we have adults and child (1year-old), could you recommend us the hotel, the best places to visit there and the total cost is about 20 million dong Tks Phone number: 0913 456 233", intent extraction model will produce following output intent-keyword = “is going to”; and intent properties = {destination = “Da Nang"; agenda = “from 14/6 to 18/6"; number of people = “5 adults and child"; price = “20 million dong"; contact = “0913 456 233"} Previously, ensemble learning methods have shown promising results in classification problem [2] and sequential labeling problem [18] We expect the similar results when applying this method to intent extraction task Therefore, we explore the tri-training and ensemble learning for intent extraction in the context of deep neural networks Because BiLSTMCRFs (Bidirectional Long Short-term Memory – Conditional Random Fields) has achieved remarkable results in sequence labeling task [12], it is reasonable to explore this model and its variations to improve task performance In particular, we come up with a novel model consisting of three BiLSTM-CRFs components This idea is based on the tri-training technique proposed in the paper [23], [21] and ensemble learning [18] In our approach, we train above-mentioned model and apply majority voting scheme over three outputs when decoding final labels for each token To ensure the voting scheme works as expected, three BiLSTM-CRFs components need to be diverse as much as possible For this reason, in each BiLSTM-CRFs components, we initialize word embedding layer with word embedding trained by different methods, namely FastText [1], GloVe [20] and Word2Vec [16] These word embeddings will be fine tunned in training phase Although this model shows some improvements in intent extraction task, as presented in section V-D, it requires a large amount of time to train 58 2019 19th International Symposium on Communications and Information Technologies (ISCIT) Therefore, we further explore this model with the idea of sharing layer Three components can share LSTM-based character encoding layer, LSTM-based word layer or CRFs decoding layer The results of these explorations will be discussed in section V-D Contribution Our contributions are: (a) We propose a novel method to improve intent extraction results based on ensemble learning method and tri-training technique in the context of deep neural networks We also explore the variations of proposed architecture in order to reduce training time (b) We explore the role of word embeddings to intent extraction task on our collected dataset We recognized that on our dataset, using each word embeddings type independently sometimes is more effective than concatenating them (c) We proposed the definition of tag/label sets for intent information for three specific domains, they are Real Estate, Transportation and Tourism We perform an extensive evaluation of proposed method compared to previous state-of-the-art methods on datasets of these domains II RELATED WORKS A User intent understanding To the best of our knowledge, most of researches attempted to mine user goals through exploiting the user queries and/or behaviors on online channels before 2012 The most popular approach during this period is to classify user intents into some pre-defined intent categories based on keywords or personalized data [6], [9], [10] Recent years, there have been some other works focusing on understanding user intent from user posts/comments such as [11], [15], [17], [22], but the number of them still remains modest Among those previous studies, several ones are highly similar to our work in terms of exploiting linguistic features and/or some high level features of the user posts/comments In 2010, X.Li [13] confirmed that determining the semantic intent of web queries not only involves identifying their semantic class, but also understanding their semantic structure They formally defined the semantic structure of noun phrase queries as comprised of intent heads (IH) and intent modifiers (IM), in which an intent head (IH) is a query segment that corresponds to an attribute name of an intent class while an intent modifier (IM) is a query segment that corresponds to an attribute value (of some attribute name) In the year 2012, the study of M Castellanos et al [3] mining user intent extraction from the social media comments is relatively similar to ours The authors tried to identify the intent phrase, like “would like to see", “are planning a trip" and extract intent attributes, like “Ages = 7; Date = June" Besides some entities can be extracted automatically by CRFs method supported by rules, it was required manual way to identify some intents This is the key difference to our endto-end method B Tri-training technique Tri-training was proposed by Zhou et al [23] is a classic method that reduces the bias of predictions on unlabeled data by utilizing the agreement of three independently trained models In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions In [21], Sebastian Ruder et al proposed a novel method that reduced the time and space complexity of classic tri-training, called multi-task tri-training They chose two NLP tasks with different characteristics, namely a sequence prediction and a classification task (POS tagging and sentiment analysis) to treat as multi-task in the model After conducting extensive experiments, they recognized that multi-task tri-training model outperforms both traditional tri-training and recent alternatives in the case of sentiment analysis However, classic tri-training is superior in POS tagging This is one of reasons why we explore classical tri-training in the context of deep neural networks to deal with our sequence labeling task C Ensemble learning An ensemble is a collection of models whose predictions are combined by weighted averaging or voting In 2004, R Caruana et al presented a method for constructing ensembles from libraries of thousands of models and achieved promising results in classification task [2] Rather than combine good and bad models in an ensemble, they used forward stepwise selection from the library of models to find a subset of models that when averaged together yield excellent performance As stated in his paper, Nam et al [18] expected to get similar results when applying ensemble learning to structured learning problems They presented Structured Learning Ensemble (SLE), a novel combination method for sequence predictions that actually incorporates correlations of label sequences Results in a consequence on both POS (Part-of-speech) and OCR (handwritten character recognition) tasks, SLE has exhibited superior performance compared with the single best model SVM-struct (had been verified in their paper) Our work shares the same idea of exploring ensemble technique for sequence labeling but we mainly focus on exploiting neural network, a method that has been recently achieved state-of-the-art results in sequential tagging III INTENT HEADS AND INTENT PROPERTIES IN THREE PARTICULAR DOMAINS As stated above, we chose three domains to extract intent information, they are Real Estate, Transportation and Tourism With the motivation to extract the intent heads, including intent keyword, intent object and intent properties or constraints, we had to survey carefully our data collection Then we made up our mind to build the set of 18 labels for Real Estate domain, 17 labels for Transportation domain and 15 labels for Tourism domain This work took us much more time than we thought Because a user post usually contains a lot of around information supporting for the main intention, it is hard to decide which information needed to be revealed and which did not After completing three sets of labels building, we annotated the data with these labels Some examples of tagged posts are presented in the figure A pair of HTML-like tags is used to mark a word/phrase that represents an intent keyword 59 2019 19th International Symposium on Communications and Information Technologies (ISCIT) or intent attribute For example, we tag the phrase "Cho thuê" with and to indicate an intent keyword In figure 1, intent keyword is marked with red and each type of other intent attributes is marked with a different color and pre-trained word representation In terms of characterbased word representation, we intergrate this to our proposed model because our data was collected from social media or discussion forums, such data is sensitive to the spelling of words, leading to the case that two words has exactly the same meaning but come with different spelling One of the obvious example is “tr" and “triệu" From pre-trained word representation perspective, initializing word embeddings with meaningful value will produce better results than randomly initialization, as stated in [5] Pre-trained word representations in our model, which are 100-dimensional dense vectors, are trained by three techniques mentioned above, namely FastText, Word2Vec and Glove Their roles are not only to improve task performance of their own component but also make three components diverge, resulting in better performance of the whole model These pre-trained embeddings will be fine-tunned during training Mathematically, the ith word has following representaion: hfi hbi  wi =    (1) epre−trained Figure Example of tagged posts IV INTENT EXTRACTION APPROACH BASED ON TRI-TRAINING AND ENSEMBLE LEARNING In this section, we will describe our proposed model beginning at bottom layers and go up to top layers Our proposed architecture consists of three BiLSTM-CRFs components, as depicted in the figure 2, each of them has three layers The lowest layer is input word embeddings followed by a BiLSTM layer BiLSTM layers are responsible for producing input to the third layer – CRFs layer that decodes labels for its own components In which hfi and hbi are forward and backward representation of word wi respectively (output of forward char-LSTM and backward char-LSTM respectively) Since three components are independent, hfi and hbi will be learned independently during training phase If word wi comes to Glove-word embedding components, epre−trained will be lookup from Glove word embedding lookup table as describes in [12] The same understanding is for remaining components B BiLSTM Encoding Layer For each word wi in sentence (w1 , w2 , , wn ) containing n words, each of them will be represented as described above Forward LSTM layer will generate hli representation of its left context Similarly, backward LSTM will also generate hri representation of its right context We follow the settings in [12] as forward and backward LSTM have different parameters The output of BiLSTM is hi = hli hri (2) C CRFs Decoding Layer Figure Our proposed model based on tri-training technique in the context of deep neural network & ensemble learning A Input word embeddings The input layers to all three components in our model are vector representation of individual word Word vector repsentation consists of character-based word representation Instead of tagging each word independently, a CRFs decoding layers is added to jointly decode labels for words in sentence This comes from the fact that data is not fully independent, each word has a dependence on its neighbors For example, in intent extraction task, three labels B-INT, BOBJ, I-OBJ frequently come together; or I-LOC cannot follow an I-INT This idea has been shown its efficiency in NER task in [12] In each component of our model, CRFs take output of BiLSTM Encoding Layer as input and decode label for each word 60 2019 19th International Symposium on Communications and Information Technologies (ISCIT) D Loss Function in Training Phase Each BiLSTM-CRFs component will calculate its own loss function as formula (1) in [12] The overall loss of proposed model will be: in figures 4, 5, We further explored idea of sharing layer such as sharing BiLSTM layer or CRFs layer However, sharing these top layers means the performance go down when testing on our data overall_loss = V EXPERIMENTAL EVALUATION lossi (3) i lossi is the loss of ith component E Majority Voting Scheme over Outputs of Three BiLSTMCRFs Components Given the input x = (x1 , x2 , , xN ), outputs from three components in our model are three sequence of labels, depicted as {y(1) , y(2) , y(3) } The final output will be constructed from these sequence of labels: y = {majority(y(1) , y(2) , y(3) ), , majority(y (1) N,y (2) N,y (3) (4) N )} in which, {y(1) i , y(2) i , y(3) i } is output labels for each token xi achieved from each three component In case of for a specific token, if three components output three different labels, we would choose the label of components with highest Viterbi score given by CRFs when decoding label We tried other strategy like constantly choosing one of three outputs but this did not show any improvements F Exploration of sharing layer A Data As stated above, we choose three intent domains to extract user intention, they are Real Estate, Transportation and Tourism The main reason for this choice is internet users in Vietnam seem to share much more intention about these three domains rather than other domains as found in our previous survey in [14] Thus, data for these three domains are more diverse We automatically crawled data from some famous forums, websites and public Facebook groups, such as webtretho.com/forum, dulich.vnexpress.net, batdongsan.com.vn and facebook.com/groups/xemaycuhanoi We have a group of students annotating data based on three sets of labels that we mentioned in the section III After carefully crosschecking among these students’ works to ensure consistency in data annotation, we get the collection of about 3000 annotated posts for each domains Then, these data is divided into training set, development set and test set with the proportion of 60%, 20% and 20% respectively B Evaluation metric For all experiments, precision, recall and F1-score at the segment (or chunk-based) level are adopted as the official task evaluation Specifically, assume that the true segment sequence of an instance is s = (s1 , s2 , , sN ) and the decoded segment sequence is s’ = (s1 , s2 , , sK ) Then, sk is called a true positive if sk ∈ s The precision and recall are the fractions of the total number of true positives among the total number of decoded and true segments respectively We report the F1score which is computed as 2.precision.recall/(precision + recall) Besides, we have the support as the number of the true segments corresponding to each label in the test set The average/total of precision, recall and F1-score is calculated as weighted average of precision, recall and F1-score of labels, in which the weight is the value of support C Training Parameters Figure Ensemble model with sharing character LSTM layer Although above-described model enjoy better performance in intent extraction task compared to single model based approach, it requires a large amount of training time Therefore, we tried to share character-based word representation over components as depicted in figure This will make character-based word representation jointly supervised by all of components Despite training time reduction, this model shows a slightly less performance than one above, as shown Pre-trained Word Embeddings: To the best of our knowledge, there has been no publicly available pre-trained embeddings for Vietnamese online texts Therefore, in our experiments we treated training dataset of each domain as a collection to build models that generate word embeddings Specifically, we used public libraries glove(https://pypi.org/project/glove/), fassttext(https://pypi.org/project/fasttext/) and genism (https://pypi.org/project/gensim/), all of them use window size of 7, to produce Glove word embeddings, FastText word embeddings and Word2Vec embeddings respectively Character-based Word Representation: The character embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward 61 2019 19th International Symposium on Communications and Information Technologies (ISCIT) LSTM The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional LSTM [12] In our model, each character embedding has dimension of 25 The dimension of forward and backward character LSTM is 25, resulting in 50dimension character-based word representation BiLSTM Encoding Layer: our model uses single layer for the forward and backward LSTMs whose dimension is 100 We apply dropout to mitigate overfitting [8] A dropout mask was applied to the final embedding layer just before the input to BiLSTM encoding layer In all of our experiments, dropout rate was fixed at 0.5 Optimization and Fine tuning: We used Adam optimization [7] with learning rate 0.001, β1 = 0.9, β2 = 0.999 and a gradient clipping of 10 The effectiveness of fine tuning embeddings has been explored in sequential prediction problem [19] In our model, each of initial embeddings will be fine-tuned, i.e modify them during gradient updates of the neural network model by back-propagating gradients Our implementation was mostly based on one of Lample et al [12] Figure Average of F1-score over runs for each model in Tourism domain D Experimental results and discussion Figure Average of F1-score over runs for each model in Transportation domain Figure Average of F1-score over runs for each model in Real Estate domain As our motivation is to explore ensemble and tri-training techniques in intent extraction task, we build six following models for each intent domain: (1), (2), (3) Three BiLSTMCRFs models proposed by Lample et al [12] (GLOVE, FASTTEXT, WORD2VEC) Each model has word embeddings initialized by Glove, FastText and Word2Vec word embeddings respectively; (4) A single BiLSTM-CRFs model proposed by by Lample et al [12], in which word embeddings initialization is the concatenation of Glove, FastText and Word2Vec embeddings (3-EMBEDDINGS); (5) Our proposed model, which is based on tri-training technique and ensemble learning, as presented in figure 2; (6) Model as depicted in figure 3, in which char BiLSTM layer is jointly learned through three components of network (SHARING CHAR-LAYER MODEL) After carefully conducted all of the above-described models for each intent domain, the averaged F1-score which is the average results over different runs on test sets, are presented in figures 4, 5, 6, correspondingly In all three domains, we observed that our proposed models reached a better result than the rest of four single models The biggest improvement was shown in Transportation domain, where our proposed method achieved 1.15% higher F1score than single BiLSTM-CRFs with Glove word embedding initialization and nearly 3% higher F1-score than single BiLSTM-CRFs with Word2Vec word embedding initialization Ensemble model with character BiLSTM layer shared by all components of network, however, showed the less improvements compared to model without sharing character BiLSTM layer Its highest performance improves F1-score by 0.88% over single BiLSTM-CRFs with Glove word embedding initialization in Real Estate domain and 2.62% over single BiLSTM-CRFs with Word2Vec word embedding initialization in Transportation domain Since data on social media are generated massively on daily basis, this model is worth to explore to save the time required in training phase Speaking of four single models, BiLSTM-CRFs with Glove word embedding initialization topped the list regarding F1score Interestingly, when those single models are combined to build the ensemble model, components based on models with lower F1-score (model with word embeddings initialized by FastText embeddings and Word2Vec embeddings) contribute positively and boost the overall results of the whole ensemble model This means that each component in proposed models can support its counterparts to reach the better overall perfor- 62 2019 19th International Symposium on Communications and Information Technologies (ISCIT) Table I THE BEST CHUNK- BASED RESULT ACHIEVED WHEN APPLYING OUR PROPOSED MODEL IN TOURISM DOMAIN Specific Label Intent Object (Obj) Brand Contact Context Description of Obj Destination Name of Accom Number of Objects Number of People Point of Departure Point of Time Price Time Period Transport avg/total Precision Recall F1-score Support 88.87 71.93 80.00 92.45 67.07 42.06 88.59 56.73 93.83 88.29 75.31 88.26 70.41 87.14 73.47 84.06 89.41 86.01 28.57 92.45 64.71 48.18 86.24 68.60 93.83 87.78 75.31 91.81 72.56 90.15 65.45 85.29 89.14 78.34 42.11 92.45 65.87 44.92 87.40 62.11 93.83 88.03 75.31 90.00 71.47 88.62 69.23 84.57 661 143 14 106 85 110 756 86 81 352 81 794 164 203 55 3691 layer is shared and jointly learned over components Besides reducing time for training, this model still achieves a promising result Overall, our proposed ensemble models are effective when dealing with user intent extraction task REFERENCES mance We also recognize that concatenation of three word embeddings types as the input for a single model is not always effective with our data Experiments show that it always enjoys better results than worst single model but perform worse than best one Therefore, it does not show some improvement in intent extraction task Real Estate domain experienced the lowest result in all experiments One reason for this is posts/comments in Real Estate are usually long and complicated (e.g figure 1), so we have much more information to extract and also much more noise to face compare to the two domains left Finally, we would like to present the best result among three domains for extracting intent head and intent properties, which we achieved when applied our proposed model for the Tourism domain in table I This is because Tourism domain has least number of labels compared to Real Estate and Transportation (15, 18 and 17 labels respectively) Moreover, after carefully analyzing data from three domains, we found that Tourism domain contains less noisy data, such as improper abbreviation, emoticons than two remaining domains VI CONCLUSION In this paper, we present a novel approach to extract user intention from social media texts, which is motivated by tritraining technique and ensemble learning in the context of deep neural network For this idea, the outputs from three independent BiLSTM-CRFs components are aggregated to produce the final prediction through majority voting scheme To ensure the diversity of these BiLSTM-CRFs components, each of them leverages word embedding initialized by different generated methods, namely Glove, FastText and Word2Vec In all experiments, our proposed models achieve higher F1-score than the rest of single-model approach proposed by Lample et al [12] Despite of better performance, one drawback of our method is time-complexity Therefore, we explore a variation of above model, in which character-based word representation [1] P Bojanowski et al., “Enriching word vectors with subword information", Transactions of the Association for Computational Linguistics, pp 135-146, 2017 [2] R Caruana, A Niculescu-Mizil, G Crew and A Ksikes, “Ensemble selection from libraries of models", In Proc of the 21st ICML, pp.18, 2004 [3] M Castellanos, et al “Intention insider: discovering people’s intentions in the social channel" In Proceedings of the 15th ICEDT, pp 614-617, 2012 [4] Y S Chang et al., “Identifying user goals from Web search results", In IEEE/WIC/ACM International Conference, pp 1038-1041, 2006 [5] R Collobert et al., “Natural language processing (almost) from scratch" Journal of Machine Learning Research, Vol.12, pp.2493-2537, 2011 [6] H.K Dai, L Zhao, Z Nie, J.R Wen, L Wang, and Y Li, “Detecting online commercial intention (OCI)" In Proc of the 15th WWW, pp 829-837, ACM, 2006 [7] K Diederik and B Jimmy, “Adam: A method for stochastic optimization" arXiv preprint arXiv:1412.6980, 2014 [8] G.E Hinton et al., “Improving neural networks by preventing coadaptation of feature detectors" arXiv preprint arXiv:1207.0580, 2012 [9] D.H Hu, D Shen, J.T Sun, Q Yang and Z Chen, “Context–aware online commercial intention",In ACML, pp.135–149, 2009 [10] B.J Jansen, D.L Booth and A Spink, “Determining the User Intent of Web Search Engine Queries", Proceeding of The 16th WWW, pp.1149–1150, ACM, 2007 [11] N Labidi, T Chaari, and R Bouaziz, “An NLP-Based Ontology Population for Intentional Structure" In International Conference on Intelligent Systems Design and Applications, pp 900-910, 2016 [12] G Lample, M Ballesteros, S Subramanian,K Kawakami and C Dyer, “Neural architectures for named entity recognition", arXiv:1603.01360, 2016 [13] X Li, “Understanding the semantic structure of noun phrase queries" In Proceedings of the 48th AMACL, pp 1337-1345, 2010 [14] Th.L Luong, Qu.T Truong, H.Tr Dang and X.H Phan, “Domain identification for intention posts on online social media" Proceeding of SoICT, pp.52–57, 2016 [15] Th.L Luong, M.S Cao, D.T Le and X.H Phan, “Intent extraction from social media texts using sequential segmentation and deep learning models", Proceeding of the 9th KSE, pp.215–220, 2017 [16] T Mikolov, K Chen, G Corrado, and J Dean, “Efficient estimation of word representations in vector space" arXiv preprint arXiv:1301.3781, 2013 [17] X.B Ngo, C.L Le, and M.Ph Tu, “Cross-Domain Intention Detection in Discussion Forums" Proceeding of the 8th SoICT, pp.173-180, 2017 [18] N Nguyen and Y Guo, “Comparisons of sequence labeling algorithms and extensions", In Proceedings of the 24th ICML, pp 681-688, 2007 [19] N Peng and M Dredze, “Named entity recognition for chinese social media with jointly trained embeddings" In Proceedings of EMNLP, pp.548-554, 2015 [20] J Pennington, R Socher and C Manning, “Glove: Global vectors for word representation" In Proceedings of the EMNLP, pp 1532-1543, 2014 [21] S Ruder and B Plank, “Strong baselines for neural semi-supervised learning under domain shift" arXiv preprint arXiv:1804.09530, 2018 [22] J Wang, G Cong, W.X Zhao and X Li, “Mining user intents in Twitter: a semi-supervised approach to inferring intent categories for tweets", Proceeding of the 29th AAAI, 2015 [23] Z.H Zhou and M Li, “Tri-training: Exploiting unlabeled data using three classifiers", IEEE Transactions on Knowledge & Data Engineering vol.11, pp.1529–1541, 2005 63 ... as comprised of intent heads (IH) and intent modifiers (IM), in which an intent head (IH) is a query segment that corresponds to an attribute name of an intent class while an intent modifier... user intents into some pre-defined intent categories based on keywords or personalized data [6], [9], [10] Recent years, there have been some other works focusing on understanding user intent. .. (a) We propose a novel method to improve intent extraction results based on ensemble learning method and tri-training technique in the context of deep neural networks We also explore the variations

Định dạng
Số trang	6
Dung lượng	404,43 KB