Named entity recognition for vietnamese real estate advertisements

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Named Entity Recognition for Vietnamese Real Estate Advertisements Son Huynh, Khiem Le, Nhi Dang, Bao Le, Dang Huynh, Binh T Nguyen Trung T Nguyen, Nhi Y T Ho University of Science, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam Hung Thinh Corp Ho Chi Minh City, Vietnam Abstract—With the booming development of the Internet and e-Commerce, advertising has appeared in almost all areas of life, especially in the real estate domain Understanding these advertising posts is necessary to capture the status of real estate transactions and rent and sale prices in different areas with various properties Motivated by that, we present the first manually annotated Vietnamese dataset in the real estate domain Remarkably, our dataset is annotated for the named entity recognition task with lots of entity types In comparison to other Vietnamese NER datasets, our dataset contains the largest number of entities We empirically investigate a strong baseline on our dataset using the API supported by the spaCy library, which comprises four main components: tokenization, embedding, encoding, and parsing For the encoding, we conduct experiments with various encoders, including Convolutions with Maxout activation (MaxoutWindowEncoder), Convolutions with Mish activation (MishWindowEncoder), and bidirectional Long short-term memory (BiLSTMEncoder) The experimental results show that the MishWindowEncoder gives the best performance in terms of micro F1-score (90.72 %) Finally, we aim to publish our dataset later to contribute to the current research community related to named entity recognition Keywords—Named Entity Recognition, embedding, Convolution, Skip Connection, LSTM customer’s demand Detecting entities is also helpful for building downstream information extraction, text summarization, or chatbot systems in the real estate domain Moreover, this helps store data more efficiently, facilitate data analysis, or build dashboards for data visualization To summarize, our contributions are summarized as follows: 1) We introduce and provide the community the first manually annotated Vietnamese dataset in the real estate domain for the NER task Our dataset is annotated with 16 different named entity types, larger than of VLSP2018 and 10 of PhoNER-COVID-19 Also, our dataset has the largest number of entities, consisting of over 53,000 entities 2) We conduct experiments using strong baselines with support of the spaCy library and empirically investigate three different encoders, including MaxoutWindowEncoder, MishWindowEncoder, and BiLSTMEncoder The experimental results show that MishWindowEncoder has the best performance in all Recall, Precision, and F1Score II RELATED WORK I INTRODUCTION Named Entity Recognition (NER) - also called Entity Identification or Entity Extraction, has become an essential and fundamental task in Natural Language Processing (NLP), which involves identifying named entities in a text and classifying them into predefined categories A named entity is a real-life object with proper identification and can be denoted with an appropriate name Named entities can be a place, person, organization, time, object, or geographic entity NER has been investigated for many years [1] however, the majority of existing research relies on a reasonably large annotated dataset, which is mainly available in popular languages such as English, French This is a bottleneck for low-resource languages like Vietnamese, so it is worth creating a novel manually annotated NER dataset for Vietnamese to accelerate Vietnamese NER research Nowadays, real estate news or advertisement sources are massive and daily posted on many different real estate websites Extracting key entities in these data sources leads to understanding the status of real estate transactions and the Corresponding author: Binh T Nguyen (VNU-HCM University of Science, Ho Chi Minh City, Vietnam) (Email: ngtbinh@hcmus.edu.vn) 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Compared to other languages, data resources for the Vietnamese NLP task are limited, specifically for the NER task To the best of our knowledge, there are only two public datasets for the Vietnamese NER task The first one is the VLSP-2018 NER dataset [2], which is an extension of the VLSP-2016 NER dataset with more data This dataset recognizes generic entities of person names, organizations, and locations in daily news articles The second one is the recently PhoNER-COVID19 [3] with the COVID-19 specified domain, which helps facilitate many types of research and downstream applications such as building question-answering systems for pandemic prevention tasks In this work, we develop and release the first Vietnamese NER dataset in the real estate domain Existing research on the Vietnamese NER approach is also pretty limited; some techniques have been proposed using various learning models such as classifier voting [4], CRF [5] Vu et al [6] propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets Thang et al [7] combined BiLSTM and CRF Moreover, they enhanced word embeddings with information from characters to archive comparative results on the VLSP-2018 NER dataset Quang et al [8] proposed an online 23 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) learning algorithm, i.e., MIRA, in combination with CRF and bootstrapping Recently, Thinh et al [3] investigated BiLSTMCNN-CRF and the pre-trained language models XLM-R and PhoBERT as well as the effect of Automatic Vietnamese word segmentation on the Vietnamese NER task The most relevant work to ours is proposed by Lien et al [9], which built an Information Extraction system for Vietnamese online real estate advertisements, but they use the rule-based approach TABLE I SEVERAL Case EXAMPLES FOR THE DATA PROCESSING PHASE Raw Text ^ ¨ -DT: \xa0195.35m2 ^ ¨ \n ^ ¨ -DTSD: 152.50m2 ^ ă t Qun 6ó lờn th c giỏ 12,89t 1tỷ700tr thuộc tầng Đến với Preprocessed Text DT 195.35m2 DTSD 152.50m2 Đất Quận Đã lên thổ cư giá 12,89 tỷ tỷ 700 tr thuộc tầng Đến với III OUR DATASET This section presents how to crawl data from many sources on the internet, preprocess and annotate the raw data One can observe our system clearly in Figure D Data Partitions After annotating the dataset, we have 3152 real estate advertisements as a golden dataset We then split this dataset into training, validation, and testing sets with a ratio 60%, 20%, and 20%, respectively Statistics of our dataset is presented in Table III TABLE III STATISTICS OF OUR DATASET # 10 11 12 13 14 15 16 Fig Our Dataset Building Process A Data Collection For this study, we crawled real estate advertisement posts dated between August 2020 and September 2020 from three different real estate websites in Vietnam, including: • propzy.vn • nhadat247.com.vn • batdongsan.com.vn B Data Processing Before training models to identify named entities, we preprocess real-estate posts to clean noisy data and follow a standard format For detail, firstly, we remove posts that not contain clear and critical information about real estate, then normalize Unicode text and particular Vietnamese word mark position in words Then, we preprocess each string with the following steps: 1) Remove meaningless characters in a string like \n, \r (ASCII codes), emotion icons, 2) Split multiple joined words 3) Fix spaCy format issue by separating dots or commas connected to a word However, we need to keep dots or commas connected to a number (like 12.5, 1,000,000) 4) Replace multiple spaces with a single space in string Table I show some examples for each case above Entity Type district_name place_name transaction_type property_certificate property_type phone number_street_name area distance province_city host_name ward_name price direction front_road email # Entities in total # Sentences in total Num of Entities 3123 7471 3490 2515 10397 2629 5415 5272 3123 1065 2079 1254 3959 737 961 25 53515 3152 IV METHODOLOGY In this paper, we aim to investigate a named entity recognition system for Vietnamese real estate documents This system can help users parse the possible real estate information field of an advertisement automatically It is worth noting that such a system is crucial and has become an indispensable tool in the real estate market In what follows, we present the problem formula of our paper and how we extract features from real estate documents and train our proposed model for this problem A Problem Formula Our paper problem is that given a Vietnamese real estate advertisement, from 16 entities which we defined in section III-C our model will detect the entity of each word C Data Annotation B Feature extraction We use Doccano [10] as a tool for labeling this dataset and meticulously define 16 entities containing the critical information of a real estate sale advertisement The comprehensive description of each entity type is briefly shown in Table II This section describes the layers we use as feature extractors for real estate documents to measure the performance Firstly, we push the input data, which are annotated documents about real estate, into a tokenizer to split sentences 24 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) TABLE II THE NAME ENTITY DEFINITION OF A GIVEN REAL ESTATE MENTIONED IN ONE ADVERTISEMENT POST Label district_name place_name transaction_type property_certificate property_type phone number_street_name area distance province_city host_name ward_name price direction front_road email Definition The district name where the real estate is located The name of one specific location, such as e.g one building, one shopping mall, or an airport, The transaction type of the real estate advertisement post, including sell, buy, or rent, The property certificate information of the real estate The property type of the real estate, such as e.g home, department, or land, Phone number of a real estate The street name or the house number with the street name The area nearby the real estate The distance, such as 10m, 20m, 300m, etc The name of the province or city where the real estate locate The host name of the real estate The ward name where the real estate is located The price related to real estate mentioned in the real estate The house direction information of the real estate Example: East or West The front-road information of real estate The contacted email into lists of words, including punctuation A settled number of UTF-8 byte characters are utilized for each word We add a token at the end of each list such that the length of each list is equal We then put these numerical lists into an Embedding layer name CharacterEmbed in spaCy [11] to vectorize to matrices N × M (where N is the number of words in each sentence and M is the figure of dimensions representing a word) that represent the meaning of each sentence Next, we use one of the following four architectures to perform feature extraction of real estate advertisements: 1) MaxoutWindowEncoder: MaxoutWindowEncoder is the architecture that gets an embedding vector as an input Then, this feature is pushed into a Convolution 1D layer with a window size of × with the number of filters being The skip connection [12] adds embedding vector with features are extracted by Convolution 1D layer After that, this information goes through a Maxout activation function Finally, this feature is normalized by BatchNormalization BatchNormalization has the effect of avoiding overfitting and securing the model more straightforward to converge In contrast, residual connections help the model retains information before feature extraction through the convolution layer One can see more detail in Figure function instead of using Maxout, which one can find more clarity in the Figure According to Diganta Misra, in 75 experimental tasks with various models (DenseNet, Inception v3, Xception Net), Mish outperforms ReLU in 55/75 tasks and overcome Swish in 53/75 tasks 3) LSTMEncoder: LSTM (Long Short Term Memory Networks) [14] was first introduced by Hochreiter & Schmidhuber in 1997 This architecture is a particular structure of RNN (Recurrent Neural Networks) proposed in 1982 by David Rumelhart [15] According to the authors, LSTM is designed to resolve long-term dependency problems that can not store information of long string data and avoid vanishing or exploding gradient problems faced in RNN In our experiment, the embedding vector is passed LSTM network has the number of hidden states equal to N which is the number of words in the input sentence to get the extracted features 4) BiLSTMEncoder: BiLSTM ( Bidirectional Long Short Term Memory Networks) [16] is based on both LSTM [14] and BiRNN [17] This architecture is similar to LSTMEncoder in section IV-B3 However, instead of using one LSTM network, this approach includes two LSTM stacks on top of each other: One takes information forwards, whereas the other takes it backward BiLSTMs effectively enhance the quantity of data available to the network, improving the content available to the algorithm Fig The architecture of MaxoutWindowEncoder and MisWindowEncoder 2) MishWindowEncoder: This Encoder has an architecture similar to MaxoutWindowEncoder However, the difference here is that this Encoder utilizes Mish [13] as an activation Fig The architecture of BiLSTMEncoder, if remove one LSTM network, the architecture will become LSTMEncoder 25 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) C Modeling B Performance Metrics spaCy API provides a powerful model for named entity recognition task is called TransitionBasedParser As per the authors of spacy, TransitionBasedParsing is an approach to structured prediction where the task of predicting the structure is mapped to a series of state transitions1 The authors claim that TransitionBasedParsing currently is more superior and quicker than Stanford’s CoreNLP [18] One can see more detail by visiting spaCy’s blog2 In this experiment, after using one of four encoder ways that we mention in Section IV-B to extract features from real estate documents, we push this informative feature into the TransitionBasedParser model to conduct entity recognition of each word in the text One can observe in detail our end-to-end pipeline in Figure In this experiment, we choose Precision, Recall, and F1Score as critical metrics in measuring the performance of our proposed models for each entity V EXPERIMENT This paper runs all experiments on a computer with Intel(R) Core(TM) i7 2CPUs running at 2.4GHz with 8GB of RAM and an Nvidia GeForce RTX2080Ti GPU with 11GB VRAM In the data processing step of this study, we use different Python packages, including, NLTK3 , and Regex4 , as tools to clean data Additionally, a package Scikit-learn5 is applied as a tool to split our dataset Finally, we use spacy [11] as a toolkit for the problem named entity recognition A Experiment Settings We detail the Real Estate Information NER task for Vietnamese with BIO labeling scheme (short for inside, outside, beginning) that was presented by Ramshaw and Marcus in 1995 [19] In our experiment, we used four different Encoders with two widths W = 64 and W = 300, which spaCy define as the number of sentence’s input width, one can find more information in their document6 From that, we have eight combinations to measure the performance of the NER for the real estate sale advertisements problem Figure IV displays the setting of our pipeline TABLE IV THE HYPER- PARAMETERS OF OUR MODELING PIPELINE Hyper-parameters Epoch Learning rate Batch size Optimizer Values 300 0.001 512 Adam with beta1 = 0.9, beta2 = 0, 99 https://spacy.io/api/architectures https://explosion.ai/blog/parsing-english-in-python https://www.nltk.org/ https://regexr.com/ https://scikit-learn.org/stable/ https://spacy.io/api/architectures P= TP TP P ×R ,R = ,F1 = × , TP + FP TP + FN P +R (1) where P stands for Precision, R is the Recall, and F is the F1-score T P denotes true positive, T N indicates true negative, F P and F N are false positive and false negative After that, we average each metric on all entities to calculate the general performance for our proposed approach C Results We compare the performance of four different backbones including: MaxoutWindowEncoder, MishoutWWindowEncoder, LSTM, and BiLSTM and each above method combine with width which is defined the input and output width and is recommended width = 64 or width = 300 by spaCy One can find more comprehensive in our experimental results in Figures V In general, four feature extractors that we apply in our experiment have good initial results in terms of Precision, Recall, and F1-score Conversely, the lowest result in each measure is 0.8486, 0.8331, and 0.8450, respectively Next, we compare four feature extractors using width = 64 in our dataset The experimental results show that using WindowEncoder has superior performance in all three measures to LSTM and BiLSTM In other words, the performance in terms of Precision, Recall of WindowEncoder are higher than variants of LSTM, especially in terms of F1-score, which is a critical metric in machine learning, the result of MaxoutWindowEncoder and MishWindowEncoder are 0.8775 and 0.8673, respectively Meanwhile, the F1-score of LSTM and BiLSTM are 0.8556 and 0.8450 correspondingly Interestingly, in the case of using four feature extractors with width = 300, the performance of WindowEncoder methods in three metrics including Precision, Recall, and F1-score once again overcome LSTM and BiLSTM This result makes it worth noting that using two WindowEncoder types always has a better performance than LSTM and BiLSTM One possible reason is that using skip connection in WindowEncoder can help the model stabilize gradient updates by keeping much information from being lost by connecting from the previous layer to the following layers and skipping some intermediate layers Furthermore, the normalization layer allows faster training and stabilization of deep neural networks by stabilizing the distribution of layer inputs during training; as a result, the model is easier to converge Additionally, using Mish [13] as an activation function can help model increase performance instead of utilizing Maxout This result is because the Mish activation function is bounded below so that it results in regularization effects and reduces overfitting Moreover, our best model from eight experiments is the model that uses MishWindowEncoder with width = 300 as a feature extractor, 26 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Our proposed data pipeline for named entity recognition problem this approach can gain a result in terms of Precision, Recall, and F1-score are 0.8914, 0.9237, and 0.9072 correspondingly Finally, one can see more detail our experimental result of each entity in terms of F1-score, Precision, and Recall in Table VI, VII, and VIII All most of the performance of each entity is pretty stable However, the entity ward_name is the challenge for our model; our best model (MishWindowEncoder with width = 300) has F1-score, Precision, and Recall are 0.7741 and 0.6738, respectively To put it differently, the ratio of correctly predicting an entity is ward_name to the total number of entities correct ward_name is just 0.6738 We aim to solve this issue in the future TABLE V THE AVERAGE RESULTS OF DIFFERENT METHODS Methods MaxoutWindowEncoder W64 LSTM W64 MishtWindowEncoder W64 BiLSTM W64 MaxoutWindowEncoder W300 LSTM W300 MishWindowEncoder W300 BILSTM W300 Precision 0,8623 0,8486 0,8677 0,8573 0,8739 0,8649 0,8914 0,8524 Recall 0.8933 0,8628 0,8669 0,8331 0,8871 0,8869 0,9237 0,8549 F1-score 0,8775 0,8556 0,8673 0,8450 0,8805 0,8758 0,9072 0,8535 VI CONCLUSION In this paper, we contribute a new dataset with 3152 advertisements in Real Estate Information Named Entity Recognition task for Vietnamese including 13 entities and propose eight methods for measuring the initial performance in terms of Recall, Precision, and F1-score We find out using MishWindowEncoder has an experimental result that outperforms total other techniques in all metrics In the future, we aim to extend our results for different datasets and apply new approaches to improve the proposed algorithms’ performance ACKNOWLEDGMENTS We want to thank the University of Science, Vietnam National University in Ho Chi Minh City, Hung Thinh Corp., and AISIA Research Lab in Vietnam for supporting us throughout this paper This research is funded by Hung Thinh Corp under grant number HTHT2021-18-01 REFERENCES [1] E F Tjong Kim Sang and F De Meulder, “Introduction to the CoNLL2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003, pp 142–147 [Online] Available: https://aclanthology.org/W03-0419 [2] H Nguyen, Q Ngo, L Vu, V Tran, and H Nguyen, “Vlsp shared task: Named entity recognition,” Journal of Computer Science and Cybernetics, vol 34, pp 283–294, 01 2019 [3] T H Truong, M H Dao, and D Q Nguyen, “COVID-19 Named Entity Recognition for Vietnamese,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021 [4] P T X Thao, T Q Tri, D Dien, and N Collier, “Named entity recognition in vietnamese using classifier voting,” ACM Transactions on Asian Language Information Processing, vol 6, no 4, Dec 2008 [Online] Available: https://doi.org/10.1145/1316457.1316460 [5] H.-Q Le, M.-V Tran, N.-N Bui, N.-C Phan, and Q.-T Ha, “An integrated approach using conditional random fields for named entity recognition and person property extraction in vietnamese text,” in 2011 International Conference on Asian Language Processing, 2011, pp 115– 118 [6] V Nguyen Hong, H Nguyen, and V Snasel, “Text normalization for named entity recognition in vietnamese tweets,” Computational Social Networks, vol 3, 12 2016 [7] L Viet-Thang and L K Pham, “Za-ner: Vietnamese named entity recognition at vlsp 2018 evaluation campaign,” Proceedings of Vietnamese Speech and Language Processing (VLSP), 2018 [8] Q H Pham, M.-L Nguyen, B T Nguyen, and N V Cuong, “Semi-supervised learning for Vietnamese named entity recognition using online conditional random fields,” in Proceedings of the Fifth Named Entity Workshop Beijing, China: Association for Computational Linguistics, Jul 2015, pp 50–55 [Online] Available: https://aclanthology.org/W15-3907 [9] L V Pham and S B Pham, “Information extraction for vietnamese real estate advertisements,” in 2012 Fourth International Conference on Knowledge and Systems Engineering, 2012, pp 181–186 [10] H Nakayama, T Kubo, J Kamura, Y Taniguchi, and X Liang, “doccano: Text annotation tool for human,” 2018, software available from https://github.com/doccano/doccano [Online] Available: https: //github.com/doccano/doccano [11] M Honnibal, I Montani, S Van Landeghem, and A Boyd, “spacy: Industrial-strength natural language processing in python,” 2020 [Online] Available: https://doi.org/10.5281/zenodo.1212303 [12] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp 770–778 [13] D Misra, “Mish: A self regularized non-monotonic activation function,” in BMVC, 2020 [14] S Hochreiter and J Schmidhuber, “Long short-term memory,” Neural Comput., vol 9, no 8, p 1735–1780, Nov 1997 [Online] Available: https://doi.org/10.1162/neco.1997.9.8.1735 [15] D Rumelhart, G E Hinton, and R J Williams, “Learning representations by back-propagating errors,” Nature, vol 323, pp 533–536, 1986 [16] A Graves and J Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in Proceedings 2005 IEEE International Joint Conference on Neural Networks, 2005., vol 4, 2005, pp 2047– 2052 vol [17] M Schuster and K Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol 45, no 11, pp 2673–2681, 1997 [18] C D Manning, M Surdeanu, J Bauer, J R Finkel, S Bethard, and D McClosky, “The stanford corenlp natural language processing toolkit.” in ACL (System Demonstrations) The Association for Computer Linguistics, 2014, pp 55–60 [Online] Available: http: //dblp.uni-trier.de/db/conf/acl/acl2014-d.html#ManningSBFBM14 27 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) TABLE VI THE F1-SCORE Entity district_name place_name transaction_type property_certificate property_type phone number_street_name area distance province_city host_name ward_name price direction front-road email MaxoutWindow EncoderW64 0.8846 0.826 0.9104 0.7204 0.8362 0.988 0.8601 0.9606 0.939 0.8128 0.93 0.6651 0.9567 0.9655 0.9591 LSTM W64 0.8267 0.7887 0.8272 0.8239 0.9205 0.6303 0.9814 0.9547 0.9365 0.9101 0.7422 0.6265 0.9577 0.9446 0.9402 0.8333 RESULTS OF DIFFERENT METHODS MishWindow EncoderW64 0.8564 0.8396 0.6539 0.815 0.9285 0.6303 0.9858 0.8276 0.9535 0.9371 0.8072 0.9264 0.9656 0.9651 0.9595 0.9231 BiLSTM W64 0.825 0.7881 0.8198 0.9101 0.6609 0.9793 0.8109 0.9301 0.9198 0.7195 0.9021 0.4977 0.948 0.9061 0.948 0.9231 MaxoutWindow EncoderW300 0.8853 0.8456 0.845 0.9311 0.714 0.8306 0.988 0.9582 0.9459 0.8772 0.93 0.5651 0.9704 0.9651 0.9739 LSTM W300 0.8786 0.8723 0.8171 0.9284 0.6595 0.8346 0.9846 0.9596 0.9497 0.8012 0.9312 0.6439 0.9664 0.9595 0.9489 MishWindow EncoderW300 0.9256 0.8944 0.8816 0.9367 0.7676 0.8555 0.9923 0.9712 0.961 0.8619 0.9511 0.7441 0.9824 0.9826 0.9829 0.9231 BiLSTM W300 0.8092 0.7842 0.9291 0.6782 0.8256 0.9792 0.8372 0.949 0.9283 0.9042 0.6817 0.5796 0.9544 0.9459 0.9568 0.7692 LSTM W300 0.8712 0.8632 0.8279 0.8973 0.6538 0.7828 0.989 0.9506 0.9596 0.8155 0.9299 0.7458 0.9792 0.9486 0.9382 MishWindow EncoderW300 0.9419 0.8728 0.8787 0.9073 0.7634 0.7984 0.9934 0.9756 0.9482 0.8298 0.9563 0.8307 0.9811 0.9769 0.9773 BiLSTM W300 0.8419 0.765 0.9283 0.6738 0.8154 0.9803 0.8823 0.9627 0.9013 0.903 0.6044 0.5982 0.9603 0.9222 0.9595 0.8333 LSTM W300 0.8861 0.8816 0.8066 0.9617 0.6652 0.8936 0.9803 0.9688 0.9399 0.7874 0.9324 0.5665 0.954 0.9708 0.9598 MishWindow EncoderW300 0.9099 0.917 0.8846 0.9681 0.7717 0.9214 0.9912 0.9668 0.9742 0.8966 0.9459 0.6738 0.9838 0.9883 0.9885 0.8571 BiLSTM W300 0.7789 0.8044 0.9298 0.6826 0.836 0.9781 0.7966 0.9356 0.957 0.9054 0.7816 0.5622 0.9486 0.9708 0.954 0.7143 TABLE VII THE PRECISION Entity district_name place_name transaction_type property_certificate property_type phone number_street_name area distance province_city host_name ward_name price direction front-road email MaxoutWindow EncoderW64 0.8831 0.8197 0.8831 0.7128 0.7758 0.989 0.8745 0.9645 0.9138 0.76 0.9443 0.7514 0.9567 0.9492 0.9762 LSTM W64 0.7925 0.8227 0.8568 0.7714 0.8922 0.6197 0.9803 0.9547 0.9349 0.904 0.7318 0.7143 0.9669 0.9419 0.9322 RESULT OF DIFFERENT METHODS MishWindow EncoderW64 0.8507 0.8505 0.7366 0.8204 0.9353 0.6197 0.9868 0.814 0.9483 0.9396 0.8481 0.9178 0.9636 0.9595 0.9651 BiLSTM W64 0.8543 0.8197 0.8525 0.9243 0.6567 0.9761 0.8137 0.9332 0.9136 0.7095 0.9211 0.5263 0.9617 0.8586 0.9535 MaxoutWindow EncoderW300 0.8779 0.7991 0.8718 0.8954 0.6963 0.8068 0.9869 0.9587 0.9302 0.8929 0.9443 0.7969 0.9652 0.9595 0.9825 TABLE VIII THE RECALL Entity district_name place_name transaction_type property_certificate property_type phone number_street_name area distance province_city host_name ward_name price direction front-road email MaxoutWindow EncoderW64 0.8861 0.8324 0.9394 0.7283 0.9067 0.9869 0.8462 0.9567 0.9656 0.8736 0.9162 0.5966 0.9567 0.9825 0.9425 LSTM W64 0.8639 0.7574 0.7996 0.8842 0.9506 0.6413 0.9825 0.9547 0.9381 0.9162 0.7529 0.5579 0.9486 0.9474 0.9483 0.7143 RESULTS OF DIFFERENT METHODS MishWindow EncoderW64 0.8622 0.8289 0.588 0.8096 0.9219 0.6413 0.9847 0.8417 0.9588 0.9347 0.7701 0.9351 0.9675 0.9708 0.954 0.8571 BiLSTM W64 0.7976 0.7588 0.7895 0.8963 0.6652 0.9825 0.8082 0.9269 0.9261 0.7299 0.8838 0.4721 0.9346 0.9591 0.9425 0.8571 [19] L Ramshaw and M Marcus, “Text Chunking Using TransformationBased Learning,” in Proceedings of the Third ACL Workshop on Very Large Corpora, 1995 28 MaxoutWindow EncoderW300 0.8929 0.8978 0.8199 0.9697 0.7326 0.8559 0.9891 0.9577 0.9622 0.8621 0.9162 0.4378 0.9756 0.9708 0.9655 ... dataset with 3152 advertisements in Real Estate Information Named Entity Recognition task for Vietnamese including 13 entities and propose eight methods for measuring the initial performance in terms... The price related to real estate mentioned in the real estate The house direction information of the real estate Example: East or West The front-road information of real estate The contacted... [11] as a toolkit for the problem named entity recognition A Experiment Settings We detail the Real Estate Information NER task for Vietnamese with BIO labeling scheme (short for inside, outside,

Định dạng
Số trang	6
Dung lượng	1,79 MB