Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data 16th China National Conference, CCL 2017 and 5th International Symposium, NLP-NABD 2017 Nanjing, China, October 13–15, 2017, Proceedings 文信息学 国中 会 中 ie t y of C h in a o C h i n e s e In f r m at io n oc LNAI 10565 Maosong Sun · Xiaojie Wang Baobao Chang · Deyi Xiong (Eds.) P ro c e s s i n gS 123 www.ebook3000.com Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science LNAI Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany LNAI Founding Series Editor Joerg Siekmann DFKI and Saarland University, Saarbrücken, Germany 10565 More information about this series at http://www.springer.com/series/1244 www.ebook3000.com Maosong Sun Xiaojie Wang Baobao Chang Deyi Xiong (Eds.) • • Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data 16th China National Conference, CCL 2017 and 5th International Symposium, NLP-NABD 2017 Nanjing, China, October 13–15, 2017 Proceedings 123 Editors Maosong Sun Tsinghua University Beijing China Baobao Chang Peking University Beijing China Deyi Xiong Soochow University Suzhou China Xiaojie Wang Beijing University of Posts and Telecommunications Beijing China ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-319-69004-9 ISBN 978-3-319-69005-6 (eBook) https://doi.org/10.1007/978-3-319-69005-6 Library of Congress Control Number: 2017956073 LNCS Sublibrary: SL7 – Artificial Intelligence © Springer International Publishing AG 2017 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland www.ebook3000.com Preface Welcome to the proceedings of the 16th China National Conference on Computational Linguistics (16th CCL) and the 5th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (5th NLP-NABD) The conference and symposium were hosted by Nanjing Normal University located in Nanjing City, Jiangsu Province, China CCL is an annual conference (bi-annual before 2013) that started in 1991 It is the flagship conference of the Chinese Information Processing Society of China (CIPS), which is the largest NLP scholar and expert community in China CCL is a premier nation-wide forum for disseminating new scholarly and technological work in computational linguistics, with a major emphasis on computer processing of the languages in China such as Mandarin, Tibetan, Mongolian, and Uyghur Affiliated with the 16th CCL, the 5th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (NLP-NABD) covered all the NLP topics, with particular focus on methodologies and techniques relating to naturally annotated big data In contrast to manually annotated data such as treebanks that are constructed for specific NLP tasks, naturally annotated data come into existence through users’ normal activities, such as writing, conversation, and interactions on the Web Although the original purposes of these data typically were unrelated to NLP, they can nonetheless be purposefully exploited by computational linguists to acquire linguistic knowledge For example, punctuation marks in Chinese text can help word boundaries identification, social tags in social media can provide signals for keyword extraction, and categories listed in Wikipedia can benefit text classification The natural annotation can be explicit, as in the aforementioned examples, or implicit, as in Hearst patterns (e.g., “Beijing and other cities” implies “Beijing is a city”) This symposium focuses on numerous research challenges ranging from very-large-scale unsupervised/ semi-supervised machine leaning (deep learning, for instance) of naturally annotated big data to integration of the learned resources and models with existing handcrafted “core” resources and “core” language computing models NLP-NABD 2017 was supported by the National Key Basic Research Program of China (i.e., “973” Program) “Theory and Methods for Cyber-Physical-Human Space Oriented Web Chinese Information Processing” under grant no 2014CB340500 and the Major Project of the National Social Science Foundation of China under grant no 13&ZD190 The Program Committee selected 108 papers (69 Chinese papers and 39 English papers) out of 272 submissions from China, Hong Kong (region), Singapore, and the USA for publication The acceptance rate is 39.7% The 39 English papers cover the following topics: – – – – Fundamental Theory and Methods of Computational Linguistics (6) Machine Translation (2) Knowledge Graph and Information Extraction (9) Language Resource and Evaluation (3) VI – – – – – Preface Information Retrieval and Question Answering (6) Text Classification and Summarization (4) Social Computing and Sentiment Analysis (1) NLP Applications (4) Minority Language Information Processing (4) The final program for the 16th CCL and the 5th NLP-NABD was the result of a great deal of work by many dedicated colleagues We want to thank, first of all, the authors who submitted their papers, and thus contributed to the creation of the high-quality program that allowed us to look forward to an exciting joint conference We are deeply indebted to all the Program Committee members for providing high-quality and insightful reviews under a tight schedule We are extremely grateful to the sponsors of the conference Finally, we extend a special word of thanks to all the colleagues of the Organizing Committee and secretariat for their hard work in organizing the conference, and to Springer for their assistance in publishing the proceedings in due time We thank the Program and Organizing Committees for helping to make the conference successful, and we hope all the participants enjoyed a memorable visit to Nanjing, a historical and beautiful city in East China August 2017 Maosong Sun Ting Liu Guodong Zhou Xiaojie Wang Baobao Chang Benjamin K Tsou Ming Li www.ebook3000.com Organization General Chairs Nanning Zheng Guangnan Ni Xi’an Jiaotong University, China Institute of Computing Technology, Chinese Academy of Sciences, China Program Committee 16th CCL Program Committee Chairs Maosong Sun Ting Liu Guodong Zhou Tsinghua University, China Harbin Institute of Technology, China Soochow University, China 16th CCL Program Committee Co-chairs Xiaojie Wang Baobao Chang Beijing University of Posts and Telecommunications, China Peking University, China 16th CCL and 5th NLP-NABD Program Committee Area Chairs Linguistics and Cognitive Science Shiyong Kang Meichun Liu Ludong University, China City University of Hong Kong, SAR China Fundamental Theory and Methods of Computational Linguistics Houfeng Wang Mo Yu Peking University, China IBM T.J Watson, Research Center, USA Information Retrieval and Question Answering Min Zhang Yongfeng Zhang Tsinghua University, China UMass Amherst, USA Text Classification and Summarization Tingting He Changqin Quan Central China Normal University, China Kobe University, Japan VIII Organization Knowledge Graph and Information Extraction Kang Liu William Wang Institute of Automation, Chinese Academy of Sciences, China UC Santa Barbara, USA Machine Translation Tong Xiao Adria De Gispert Northeast University, China University of Cambridge, UK Minority Language Information Processing Aishan Wumaier Haiyinhua Xinjiang University, China Inner Mongolia University, China Language Resource and Evaluation Sujian Li Qin Lu Peking University, China The Hong Kong Polytechnic University, SAR China Social Computing and Sentiment Analysis Suge Wang Xiaodan Zhu Shanxi University, China National Research Council of Canada NLP Applications Ruifeng Xu Yue Zhang Harbin Institute of Technology Shenzhen Graduate School, China Singapore University of Technology and Design, Singapore 16th CCL Technical Committee Members Rangjia Cai Dongfeng Cai Baobao Chang Xiaohe Chen Xueqi Cheng Key-Sun Choi Li Deng Alexander Gelbukh Josef van Genabith Randy Goebel Tingting He Isahara Hitoshi Heyan Huang Xuanjing Huang Donghong Ji Turgen Ibrahim Qinghai Normal University, China Shenyang Aerospace University, China Peking University, China Nanjing Normal University, China Institute of Computing Technology, CAS, China KAIST, Korea Microsoft Research, USA National Polytechnic Institute, Mexico Dublin City University, Ireland University of Alberta, Canada Central China Normal University, China Toyohashi University of Technology, Japan Beijing Polytechnic University, China Fudan University, China Wuhan University, China Xinjiang University, China www.ebook3000.com Organization Shiyong Kang Sadao Kurohashi Kiong Lee Hang Li Ru Li Dekang Lin Qun Liu Shaoming Liu Ting Liu Qin Lu Wolfgang Menzel Jian-Yun Nie Yanqiu Shao Xiaodong Shi Rou Song Jian Su Benjamin Ka Yin Tsou Haifeng Wang Fei Xia Feiyu Xu Nianwen Xue Erhong Yang Tianfang Yao Shiwen Yu Quan Zhang Jun Zhao Guodong Zhou Ming Zhou Jingbo Zhu Ping Xue Ludong University, China Kyoto University, Japan ISO TC37, Korea Huawei, Hong Kong, SAR China Shanxi University, China NATURALI Inc., China Dublin City University, Ireland; Institute of Computing Technology, CAS, China Fuji Xerox, Japan Harbin Institute of Technology, China Polytechnic University of Hong Kong, SAR China University of Hamburg, Germany University of Montreal, Canada Beijing Language and Culture University, China Xiamen University, China Beijing Language and Culture University, China Institute for Infocomm Research, Singapore City University of Hong Kong, SAR China Baidu, China University of Washington, USA DFKI, Germany Brandeis University, USA Beijing Language and Culture University, China Shanghai Jiaotong University, China Peking University, China Institute of Acoustics, CAS, China Institute of Automation, CAS, China Soochow University, China Microsoft Research Asia, China Northeast University, China Research & Technology, the Boeing Company, USA 5th NLP-NABD Program Committee Chairs Maosong Sun Benjamin K Tsou Ming Li Tsinghua University, China City University of Hong Kong, SAR China University of Waterloo, Canada 5th NLP-NABD Technical Committee Members Key-Sun Choi Li Deng Alexander Gelbukh Josef van Genabith Randy Goebel KAIST, Korea Microsoft Research, USA National Polytechnic Institute, Mexico Dublin City University, Ireland University of Alberta, Canada IX 468 M Lu et al (ni) (qdq) (ehileged) (bi) (harigvcahv) (bqlvn_a)” for example, in the sentence, the word “ ” (“bqdqgsan”, “bvdvgsan”) and “ ” (“vdv”, “vtv”, “qdq”, “qtq”) are polyphones As illustrated in Fig 5, Latin-transliteration form was annotated below each Mongolian word The word “ ” (meaning: think, paint) corresponds to two kinds of Latin form, the word “ ” (meaning: omen, smoke, now, estrus) corresponds to four correct spellings; The correct sentence is denoted by the path with the line in bolder, i.e “minu bqdqgsan ni qdq ehileged bi harigvcahv bqlvn_a.” N-gram language model [13] has been widely used in statistical language model The probability of a Mongolian word sequence w ¼ w1 w2 .wm can be written in the form of conditional probability: pwị ẳ pw1 w2 .wm Þ ¼ Ym À Á Ym À Á iÀ1 p w _ w p wi jwiÀ1 % i iÀn ỵ iẳ1 iẳ1 1ị The probability of the m-th words wm depends on all the words w1 w2 .wmÀ1 We can now use this model to estimate the probability of seeing sentences in the corpus by providing a simple independence assumption based on the Markov assumption [14] Corresponding to the language model, the current word is only related to the previous n−1 words From the Eq (1), we can see that the target of language model is how À to estimate Á the conditional probability of the next word in the list using p wi _ wi1 in ỵ The most commonly probability estimation method we used is the maximum likelihood estimation (MLE) p wi jwi1 in ỵ c wiin ỵ ẳ i1 c win þ ð2Þ À Á c wiÀ1 iÀn þ means the total count of the N-gram in the corpus However, a drawback of the MLE is that the N-tuple corpus which does not appear in the training set will be given zero-Probability Smoothing algorithm can be used to solve this kind of zero-Probabilities problem In this paper, we use the Kneser-Ney smoothing algorithm [15] Experiment The principle contribution in this paper is twofold: (1) we built our own resource library including dictionaries containing all polyphones, and dataset used in training corpus and test corpus; (2) We conduct the language model based method to deal with polyphone errors In this section, we describe how the resource is created and show the experimental evaluation and analysis 5.1 Data Resource In general, there is a limitation in the number of Mongolian linguistic resources that are publicly available free for the research purpose Therefore, we have to spend tangible efforts to acquire/annotate and verify our own linguistic resources in order to properly develop the proofreading system www.ebook3000.com Language Model for Mongolian Polyphone Proofreading 469 The proposed statistical approach rely on pre-defined confusion sets, which are comprised of commonly confounded words, such as polyphone sets of {“qdq”, “qtq”, “vtv”, “vdv”} illustrated in Table and the good-quality dataset used as training and testing dataset After a period of collecting and collating, finally we finished creating the confusion sets by 252 verbal stems all put into the verbal stem dictionary and 998 whole words injected in nominal stem dictionary Concatenated by verbal suffixes and case suffixes, the verbal stems can derive about 22,971 tokens and 998 whole words can derive about 19,407 tokens when concatenated by case suffixes Since the textual resource in the Internet is full of coding errors, dataset used for creating training set and test data is constructed by following three steps: (1) Original Mongolian texts of about 50,000 sentences written in national standard code are obtained from the Mongolian news web (2) The texts are corrected preliminarily by automatic proofreading system without polyphone correction module For the polyphone, randomly select one candidate Then, sentences which contain polyphone are picked out (3) The manual annotation task carries out on those selected sentences under the open source platform BRAT [16] The annotation takes about one and a half months with four Mongolian native persons The collated Mongolian corpus, each of which contained the polyphones, consists of 41,416 sentences and 2,822,337 words That was split into training data of 38,416 sentences and test data of 3,000 sentences 5.2 N-gram Language Model Based Approach We take the Correction Accurate Rate (CAR) as the evaluate metric, which is dened as CAR ẳ Ncorrect Ntotal 3ị Ncorrect denotes the number of all polyphone that are correctly proofread And Ntotal is the total number of all the polyphone needed to be corrected We conduct the n-gram language model by SRILM toolkit [15] with Kneser-Ney discounting The calibration progress can be divided into two steps: Firstly, correct all Mongolian words one by one according to the rule based approach; Then, we check whether polyphone is contained or not in those sentences If polyphone is contained, taking sentence as the basic unit, we further determine the best one according to the Language Model To improve the performance of CAR, we respectively conduct unigram, bigram and trigram model to evaluate the experiment As the result shown in Fig 6, trigram model performs best by accuracy rate 95.36%, which is 62% higher than that of polyphones in original text without correction Both bigram and trigram model outperformed the unigram model The result shows that polyphone proofreading performance is effectively improved when contextual information is utilized in the process Because of data sparseness, performance of trigram model did not show significant improvement with slight promotion of 0.06% compared to bigram model Experiment will lead to better results if the experimental dataset become more adequate We also test the overall performance as the result illustrated in the Fig We can see that the overall system performance, when applied to the trigram model in polyphone proofreading, has the improvement by 16.1% 470 M Lu et al CAR 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% CAR Original Polyphone Accuracy Unigram Bigram Trigram Fig Performance comparison between the Rule-based and LM based approaches CAR 100.00% 90.00% CAR 80.00% 70.00% Rule-based Rule-based + Trigram Fig Overall system performance comparison Conclusion In this paper, we present the statistical language model based approach after the description of the MAPS framework, and introduce in detail the construction of the resource library Our purpose is the development of a high-quality correction module for polyphonic words which is one of the real-word correction problems From the experiment result, N-gram language model was proved to be an effective approach to polyphone correction with the overall performance of the automatic proofreading system improved by 16.1% In future work, we plan to expand our training sets and try to use other methods to detect and correct polyphones Moreover, we will extend our method to allow for other kinds of real-word errors such as semantic errors, malapropisms structural errors and pragmatic errors Acknowledgements This paper is supported by The National Natural Science Foundation of China (No 61563040), Inner Mongolia Natural Science Foundation of major projects (No 2016ZD06) and Inner Mongolia Natural Science Fund Project (No 2017BS0601) www.ebook3000.com Language Model for Mongolian Polyphone Proofreading 471 References Wang, W., Bao, F., Gao, G.: Mongolian named entity recognition system with rich features In: COLING, pp 505–512 (2016) Bao, F., Gao, G., Wang, H., et al.: Cyril Mongolian to traditional Mongolian conversion based on rules and statistics method J Chin Inf Process 31(3), 156–162 (2013) Bao, F., Gao, G., Yan, X., et al.: Segmentation-based Mongolian LVCSR approach In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8136–8139 IEEE (2013) Islam, A., Inkpen, D.: Real-word spelling correction using Google web 1T n-gram data set In: International Conference on Natural Language Processing and Knowledge Engineering, Nlp-Ke, pp 1689–1692 IEEE (2009) Su, C., Hou, H., Yang, P., Yuan, H.: Based on the statistical translation framework of the Mongolian automatic spelling correction method J Chin Inf Process 175–179 (2013) Si, L.: Mongolian proofreading algorithm based on nondeterministic finite automata Chin J Inf 23(6), 110–115 (2009) Jiang, B.: Research on Rule-Based the Method of Mongolian Automatic Correction Inner Mongolia University, Hohhot (2014) Yan, X., Bao, F., Wei, H., Su, X.: A novel approach to improve the Mongolian language model using intermediate characters In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y (eds.) CCL/NLP-NABD -2016 LNCS, vol 10035, pp 103–113 Springer, Cham (2016) doi:10 1007/978-3-319-47674-2_9 Gong, Z.: Research on Mongolian code conversion Inner Mongolia University (2008) 10 GB 25914-2010: Information technology of traditional Mongolian nominal characters, presentation characters and control characters using the rules (2011) 11 Surgereltu, : Mongolia Orthography Dictionary, 5th edn Inner Mongolia People’s Publisher, Hohhot (2011) 12 Inner Mongolia University: Modern Mongolian 2nd edn Inner Mongolia People’s Publisher, Hohhot (2005) 13 Zong, C.: Statistical Natural Language Processing, 2nd edn Tsinghua University Press, Beijing (2008) 14 Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn Prentice Hall, Upper Saddle River (2009) 15 Stolcke, A.: SRILM - an extensible language modeling toolkit In: Proceedings of International Conference on Spoken Language Processing, Denver, Colorado (2002) 16 Pontus, S., Sampo, P., Goran T.: Brat: a web-based tool for NLP-assisted text annotation In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp 102–107 End-to-End Neural Text Classification for Tibetan Nuo Qun1,2 , Xing Li1 , Xipeng Qiu1(B) , and Xuanjing Huang1 School of Information Science and Technology, Tibet University, No 10 Zangda, Tibet, China xpqiu@fudan.edu.cn School of Computer Science, Fudan University, 825 Zhangheng Road, Shanghai, China Abstract As a minority language, Tibetan has received relatively little attention in the field of natural language processing (NLP), especially in current various neural network models In this paper, we investigate three end-to-end neural models for Tibetan text classification The experimental results show that the end-to-end models outperform the traditional Tibetan text classification methods The dataset and codes are available on https://github.com/FudanNLP/Tibetan-Classification Introduction Although some efforts have been made for Tibetan natural language processing (NLP), it still lags behind research on the other resource-rich and widely-used languages Since Tibetan is a resource-poor language and is lack of large scale corpus, it is hard to build state-of-the-art machine learning based NLP systems For example, Tibetan word segmentation technology is not well developed even until now Recently, deep learning approaches have achieved great successes in many natural language processing (NLP) tasks, which adopt various neural networks to model natural language, such as neural bag-of-words (NBOW), recurrent neural networks (RNNs) [2,17], recursive neural networks (RecNNs) [16], convolutional neural networks (CNN) [3,11] Different from the traditional NLP methods, neural models take distributed representations (dense vectors) of words in a text as input, and generate a fixed-length vector as the representation of the whole text A good representation of the variable-length text should fully capture the semantics of natural language These neural models can alleviate the burden of handcrafted feature engineering and allow researchers to build end-to-end NLP systems without the need for external NLP tools, such as word segmenter and parser Therefore, deep learning provides a great opportunity to Tibetan NLP as well as other low-resource languages In this paper, we investigate several end-to-end neural models for Tibetan NLP Specifically, we choose Tibetan text classification due to its popularity and c Springer International Publishing AG 2017 M Sun et al (Eds.): CCL 2017 and NLP-NABD 2017, LNAI 10565, pp 472–480, 2017 https://doi.org/10.1007/978-3-319-69005-6 39 www.ebook3000.com End-to-End Neural Text Classification for Tibetan 473 wide applications Since there is no explicit segmentation between Tibetan words and the word vocabulary is also very large, we directly model Tibetan text in syllable and letter (character) levels without any explicit word segmentation In detail, we investigate three popular neural models: NBOW, RNN and CNN Our contributions can be summarized as follows – This is the first time to use end-to-end neural network method for Tibetan text classification Experiments shown our proposed models are effective which not rely on external NLP tools – We also construct a corpus for Tibetan text classification and make it available to anyone who need it Softmax z Neural Network Model x1 x1 xi xi xn xn Fig Each syllable is converted to a multi-dimensional vector xi All these vectors are feed into a neural network model and product z representing the text Then the linear classifier with a softmax function would compute the probabilities of each class The Proposed Framework As shown in Fig 1, our proposed framework consists of three layers: (1) the embedding layer maps each syllable or letter in text to a dense vector; (2) the encoding layer represents the text with a fixed-length vector and (3) the output layer predicts the class label 2.1 Embedding Layer In the Tibetan script, many Tibetan words are monosyllabic, consisting of several syllables Syllables are separated by a tsheg, which often functions almost as a space and is not used to divide words The Tibetan alphabet has 30 basic letters for consonants and letters for vowels Each consonant letter assumes an The vowels inherent vowel, in the Tibetan script it’s are placed above consonants as diacritics, while the vowel is placed underneath consonants Figure shows an example of Tibetan word structure 474 N Qun et al Fig Structure of a Tibetan word (programming) The neural NLP models usually take distributed representations of words as input, however it is difficult for Tibetan for two major reasons: one is that there is no delimiter to mark the boundary between two words and Tibetan word segmentation technology is still not well developed even until now; and another is that Tibetan vocabulary is very large and usually contains millions of words Therefore, the representations of rare and complex words are poorly estimated Here, we gain distributed representations for each syllable by using a lookup table Similarly, there are some work on English and Chinese to model text on character or morpheme level [13] Given a Tibetan syllable sequence x = {x1 , x2 , · · · , xT }, we first use a lookup layer to get the vector representation (embeddings) xi of the each syllable xi 2.2 Encoding Layer The encoding layer converts an embeddings sequence of syllables into a vectorial representation z with different neural models, and then feed the representation to an output layer A good representation should fully capture the semantics of natural language The role of this layer is to capture the interaction among the syllables in text Neural Bag-of-Words A simple and intuitive method is the Neural Bag-of-Words (NBOW) model, in which the representation of text can be generated by averaging its constituent word representations However, the main drawback of NBOW is that the word order is lost Although NBOW is effective for general document classification, it is not suitable for short sentences Here, we adopt a simplified edition of Deep Averaging Networks (DAN) [7] The difference is that all non-linear hidden layers are removed here Recurrent Neural Network Sequence models construct the representation of sentences based on the recurrent neural network (RNN) [15] or the gated versions of RNN [2,17] Sequence models are sensitive to word order, but they have a bias towards the latest input words Here, we adopt Long short-term memory network (LSTM) [5] to model text, which specifically address this issue of learning long-term dependencies of RNN The LSTM maintains a separate memory cell inside it that updates and exposes its content only when deemed necessary www.ebook3000.com End-to-End Neural Text Classification for Tibetan 475 Table Dataset statistics Classes Documents Titles Politics 2117 983 Economics 1359 Education 510 Tourism Environment 945 244 Language 258 Literature 665 Religion 492 Arts 519 Medicine 272 Customs 840 Instruments 2132 986 1370 512 953 255 259 670 502 520 275 842 Total 9276 9204 Convolutional Models Convolutional neural network (CNN) is also used to model sentences [3,6,10] It takes as input the embeddings of words in the sentence aligned sequentially, and summarizes the meaning of a sentence through layers of convolution and pooling, until reaching a fixed-length vectorial representation in the final layer CNN can maintain the word order information and learn more abstract characteristics Here, we also adopt the CNN model used in [11] 2.3 Output Layer After obtaining the text encoding z, we feed it to a fully connected layer followed by a softmax non-linear layer that predicts the probability distribution over classes ˆ = softmax(W z + b) y (1) ˆ is prediction probabilities, W is the weight which needs to be learned, where y b is a bias term Given a corpus with N training samples (xi , yi ), the parameters of the network are trained to minimise the cross-entropy of the predicted and true distributions N C yij log(ˆ yij ), L(ˆ y , y) = − (2) i=1 j=1 where yij is the ground-truth label of xi ; yˆij is the predicted probability, and C is the number of classes 476 N Qun et al Experiments In this section, we present our experiment results and perform some analyses to better understand our models 3.1 Dataset Although several pioneer papers [9,12] talk about Tibetan in many nature language tasks, there is no public available dataset for Tibetan text classification1 Hence we create the Tibetan News Classification Corpus (TNCC) This dataset is collected from China Tibet Online website2 It has the most abundant and official Tibetan articles and they are classified manually under twenty classes We Table Performances on title classification Model Acc Prec Rec F1 word2vec + GaussianNB 28.88 27.33 25.78 22.77 46.84 45.70 32.00 32.19 word2vec + SVM CNN (syllable) CNN (letter) LSTM (syllable) LSTM (letter) NBOW (syllable) NBOW (letter) 54.42 47.97 62.65 59.74 61.56 43.02 49.22 39.57 58.33 59.57 60.35 42.20 48.34 38.63 56.43 56.06 55.52 33.18 48.64 38.03 56.65 57.44 56.99 33.96 Table Detailed results of LSTM model on title classification Class Prec Rec F1 Politics Economics Education Tourism Environment Language Literature Religion Arts Medicine Customs Instruments 65.63 66.97 57.87 55.45 60.78 70.37 27.78 70.51 56.72 66.23 23.68 78.01 67.09 51.59 63.55 60.10 65.95 61.29 19.61 62.50 52.78 69.86 24.65 80.88 68.61 41.95 70.47 65.59 72.09 54.29 15.15 56.12 49.35 73.91 25.71 83.97 Although [12] built a large scale Tibetan text corpus, but they did not release it http://tb.tibet.cn www.ebook3000.com End-to-End Neural Text Classification for Tibetan 477 pick out the largest and most discriminative twelve classes where some articles still have ambiguity inherently To evaluate the ability of dealing with short and long Tibetan text, we construct two text classification datasets: one is news title classification; another is news document classification The detailed statistics is shown in Table There are 52,131 distinct syllable in the dataset Each document contains 689 syllables and each title contains 16 syllables in average The corpus is split into training set, development set and test set The training set makes up 80% of the dataset and both development set and test set take 10% of it Table Performances on document classification 3.2 Model Acc Prec Rec F1 Onehot + MultinomialNB word2vec + GaussianNB Onehot + SVM word2vec + SVM 59.72 52.77 63.52 69.71 67.18 54.24 61.83 67.75 55.17 52.22 61.17 67.45 CNN (syllable) LSTM (syllable) 61.51 59.39 56.65 57.34 54.79 52.63 48.62 49.59 NBOW (syllable) NBOW (letter) 74.02 75.56 71.38 72.40 57.93 49.34 45.45 46.08 53.65 54.97 60.85 67.59 Experimental Setup In all models, syllable embedding size, text encoding size, learning rate and decaying rate are the same We choose 500-dimensional vectors to represent both syllables and text Other parameters are initialised randomly In CNN model, we use three convolutional layers in the encoding layer Adagrad optimizer [4] is used with decaying rate 0.93 and initial learning rates 0.5, 1.0, 1.5, 2.0 to match different models respectively To improve the performance, we use word2vec [14] to pre-train embeddings of Tibetan syllables on Tibetan Wikipedia corpus3 3.3 Results We conduct two experiments on our corpus One is news title classification, and another is news document classification Compared Models To evaluate its effectiveness, we compare it with several baseline models, such as naive Bayesian classifier (NB) and support vector machine (SVM) Their inputs are embeddings trained by word2vec Besides syllables, we also investigate the performance of using Tibetan letters as input of neural models https://bo.wikipedia.org 478 N Qun et al News Title Classification The results of news title classification are shown in Table We can see that the end-to-end models consistently outperform the other methods LSTM achieves better performance than CNN and NBOW The detailed results are shown in Table News Document Classification The results of news document classification are shown in Table The end-to-end models consistently outperform the other methods NBOW achieves better performance than CNN and LSTM, whose detailed results are shown in Table The reason is that the length of document is large and CNN and LSTM suffer from its efficiency Table Detailed results of NBoW model on document classification Class Prec Rec F1 Politics Economics Education Tourism Environment Language Literature Religion Arts Medicine Customs Instruments 73.16 64.29 75.00 77.08 75.00 72.73 100.00 62.34 58.54 89.36 59.26 100.00 78.09 72.00 69.44 69.81 68.00 50.00 53.85 84.21 57.14 77.78 76.19 100.00 75.54 67.93 72.11 73.27 71.33 59.26 70.00 71.64 57.83 83.17 66.67 100.00 Related Work Recently, Tibetan text classification has become popular because of its wide applications In the past years, several rule-based or machine learning based methods are adopted to improve the performance of Tibetan text classification [1,8,9] These methods used word-based features, such as vector space model (VSM), to represent texts [9] used distributed representations of Tibetan words as features to improve the performance of Tibetan text classification However, these methods are based on Tibetan words Since the fundamental NLP tools, such as Tibetan word segmentation and part-of-speech tagging, are still undeveloped for Tibetan information processing, these methods are limited Conclusion In this paper, we investigate several end-to-end neural models for Tibetan NLP Specifically, we choose Tibetan text classification due to its popularity and wide www.ebook3000.com End-to-End Neural Text Classification for Tibetan 479 applications Since there is no explicit segmentation between Tibetan words and the word vocabulary is also very large, we directly model Tibetan text in syllable and letter (character) levels without any explicit word segmentation Acknowledgments We would like to thank the anonymous reviewers for their valuable comments This work was partially funded by “Everest Scholars” project of Tibet University, National Natural Science Foundation of China (No 61262086), Autonomous Science and Technology Major Project of the Tibet Autonomous Region Science and Technology References Cao, H., Jia, H.: Tibetan text classification based on the feature of position weight In: International Conference on Asian Language Processing (IALP), pp 220–223 IEEE (2013) Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv:1412.3555 (2014) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch J Mach Learn Res 12, 2493– 2537 (2011) Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization J Mach Learn Res 12, 2121–2159 (2011) Hochreiter, S., Schmidhuber, J.: Long short-term memory Neural Comput 9(8), 1735–1780 (1997) Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences In: Advances in Neural Information Processing Systems (2014) Iyyer, M., Manjunatha, V., Boyd-Graber, J., Iii, H.D.: Deep unordered composition rivals syntactic methods for text classification In: Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, pp 1681–1691 (2015) Jiang, T., Yu, H.: A novel feature selection based on Tibetan grammar for Tibetan text classification In: 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp 445–448 IEEE (2015) Jiang, T., Yu, H., Zhang, B.: Tibetan text classification using distributed representations of words In: International Conference on Asian Language Processing (IALP), pp 123–126 IEEE (2015) 10 Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences In: Proceedings of ACL (2014) 11 Kim, Y.: Convolutional neural networks for sentence classification arXiv preprint arXiv:1408.5882 (2014) 12 Liu, H., Nuo, M., Wu, J., He, Y.: Building large scale text corpus for Tibetan natural language processing by extracting text from web In: 24th International Conference on Computational Linguistics, p 11 Citeseer (2012) 13 Luong, M.T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology In: CoNLL-2013, vol 104 (2013) 14 Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space Computer Science (2013) 480 N Qun et al 15 Mikolov, T., Karafi´ at, M., Burget, L., Cernock` y, J., Khudanpur, S.: Recurrent neural network based language model In: INTERSPEECH (2010) 16 Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank In: Proceedings of EMNLP (2013) 17 Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks In: Advances in Neural Information Processing Systems, pp 3104–3112 (2014) www.ebook3000.com Author Index Akef, Alaa Mamdouh Bao, Feilong 461 Bao, Hongyun 135 Bao, Xinqi 306 Bi, Sheng 147 Cao, Yixin 172 Che, Wanxiang 60 Chen, Wei 110 Chen, Yubo 122 Chen, Yusen 224 Chen, Zhehuai 398 Chi, Junqi 343 Dai, Yuan 184 Deng, Juan 224 Dong, Chuanhai 197 Feng, Chong 159 Feng, Wenhe 424 Gao, Guanglai 461 Gao, Yang 355 Gong, Jing 159 Guo, Jiang 60 Guo, Maosheng 263 Guo, Yidi 355 Han, Yaqian 73 He, Shizhu 273 He, Tianxing 398 Hou, Lei 172 Huang, Degen 387 Huang, Heyan 159, 355, 439 Huang, Kaiyu 387 Huang, Xuanjing 472 Huang, Zuying 343 Ji, Donghong 224 Li, Fang 37 Li, Juanzi 172 Li, Lei 343 Li, Qiang 73 Li, Ruoyu 211 Li, Shoushan 13 Li, Sujian 251 Li, Tuya 48 Li, Wei 287 Li, Xia 424 Li, Xiao 449 Li, Xing 472 Li, Yimeng 237 Ling, Zhen-Hua 295 Liu, Cao 273 Liu, Feng 147 Liu, Jiahao 333 Liu, Jinshuo 224 Liu, Kang 122, 273 Liu, Maofu 424 Liu, Shulin 122 Liu, Ting 60, 263 Liu, Yang 411 Liu, Yong 159 Liu, Yujian 439 Liu, Zhe 97 Liu, Zhuang 97, 387 Long, Congjun 439 Lu, Bingbing 371 Lu, Chi 355 Lu, Min 461 Luo, Wei 122 Luo, Zhunchen 122 Ma, Zhiqiang 48 Mao, Yuzhao 321 Men, Yahui 97 Mi, ChengGang 449 Pan, Jeff 224 Qi, Guilin 147 Qi, Zhenyu 135 Qian, Yanmin 398 Qin, Bing 333 Qiu, Xipeng 472 Qun, Nuo 472 482 Author Index Ren, Han 424 Ren, Yafeng 424 Shao, Yanqiu 237 Shi, Ge 159 Shi, Shumin 439 Sun, Chengjie 333 Sun, Maosong 211 Wang, Chunqi 110 Wang, Houfeng 251 Wang, Lei 449 Wang, Limin 13 Wang, Run-Ze 295 Wang, Tianhang 439 Wang, Xiaojie 37, 321 Wang, Yingying Wang, Yining 85 Wu, Huijia 197 Wu, Songze 371 Wu, Tianxing 147 Wu, Wei 251 Wu, Yue 398 Wu, Yunfang 287, 306 Xiao, Tong 73 Xie, Zhipeng 24 Xu, Bo 110, 135 Xu, Chengcheng 371 Xu, Jiaming 135 Xu, Kang 147 Yan, Qian 13 Yang, Erhong Yang, Hang 273 Yang, Hongkai 237 Yang, Shuangtao 48 Yang, YaTing 449 Yang, Yunlong 97 Yi, Xiaoyuan 211 Yu, Kai 398 Zeng, Daojian 184 Zeng, Junxin 184 Zhan, Chen-Di 295 Zhang, Huaping 371 Zhang, Jiajun 85, 197, 411 Zhang, Jing 172, 387 Zhang, Li 48 Zhang, Yazhao 343 Zhang, Yu 263 Zhao, Dezhi 263 Zhao, Jun 122, 273 Zhao, Yang 85 Zheng, Bo 60 Zheng, Hai-Tao 172 Zheng, Suncong 135 Zhou, Chang 321 Zhou, Guodong 13 Zhou, Huiwei 97 Zhou, Peng 135 Zhu, Jingbo 73 Zhu, ShaoLin 449 Zong, Chengqing 85, 197, 411 www.ebook3000.com ... Xiong (Eds.) • • Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data 16th China National Conference, CCL 2017 and 5th International Symposium,... as Mandarin, Tibetan, Mongolian, and Uyghur Affiliated with the 16th CCL, the 5th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (NLP-NABD) covered... focus on methodologies and techniques relating to naturally annotated big data In contrast to manually annotated data such as treebanks that are constructed for specific NLP tasks, naturally annotated