(Luận văn thạc sĩ) xây dựng mô hình suy luận ngôn ngữ tự nhiên trong tiếng việt sử dụng dữ liệu truyện kiều

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ TRẦN THỊ MAI TRANG XÂY DỰNG MƠ HÌNH SUY LUẬN NGÔN NGỮ TỰ NHIÊN TRONG TIẾNG VIỆT SỬ DỤNG DỮ LIỆU “TRUYỆN KIỀU” NGÀNH: KHOA HỌC MÁY TÍNH SKC007968 Tp Hồ Chí Minh, tháng 4/2023 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ TRẦN THỊ MAI TRANG XÂY DỰNG MƠ HÌNH SUY LUẬN NGƠN NGỮ TỰ NHIÊN TRONG TIẾNG VIỆT SỬ DỤNG DỮ LIỆU “TRUYỆN KIỀU” NGÀNH: KHOA HỌC MÁY TÍNH – 8480101 Tp Hồ Chí Minh, tháng 04 năm 2023 LÝ LỊCH KHOA HỌC I LÝ LỊCH SƠ LƯỢC Họ & tên: Trần Thị Mai Trang Giới tính: Nữ Ngày, tháng, năm sinh: 10/09/1995 Nơi sinh: Phú Yên Quê quán: Thái Bình Dân tộc: Kinh Chỗ riêng địa liên lạc: số 158/7/9A, đường số 11, tổ 10, khu phố 3, phường Linh Xuân, quận Thủ Đức, Tp Hồ Chí Minh Điện thoại quan: Điện thoại: 0357 622 910 Fax: E-Mail: maitrangtranthi1995@gmail.com II QUÁ TRÌNH ĐÀO TẠO Đại học: Hệ đào tạo: Đại học Chính quy Thời gian đào tạo từ 09/2014 đến 06/2018 Nơi học (trường, thành phố): Trường Đại học Sư phạm Thành phố Hồ Chí Minh, Tp Hồ Chí Minh Ngành học: Sư phạm Tin học Tên đồ án, luận án môn thi tốt nghiệp: Xây dựng Website bán mỹ phẩm Ngày & nơi bảo vệ đồ án, luận án thi tốt nghiệp: 13/12/2017 trường Đại học Sư phạm TP.HCM Người hướng dẫn: ThS Nguyễn Quang Tấn III Q TRÌNH CƠNG TÁC CHUYÊN MÔN KỂ TỪ KHI TỐT NGHIỆP ĐẠI HỌC: Thời gian 8/2018 - đến Nơi công tác Trường THPT Đào Sơn Tây – Sở Giáo dục đào tạo Tp Hồ Chí Minh i Cơng việc đảm nhiệm Giáo viên BertDataGenerator để ánh xạ câu tiền đề câu giả thuyết thành token, làm đầu vào cho mơ hình Trong q trình huấn luyện, luận văn thiết đặt Batch size 8, Learning rate 10-4 Epoch 32 Nếu mơ hình khơng cải thiện độ lỗi tập đánh giá sau Epoch việc huấn luyện dừng Trong q trình huấn luyện, mơ hình xáo trộn liệu đầu vào − Sau hoàn tất bước huấn luyện, luận văn tiến hành kiểm tra mơ hình với tập liệu kiểm thử Cũng giống trình huấn luyện, tập liệu dùng để kiểm thử cho vào hàm BertDataGenerator để ánh xạ thành token, dùng token mơ hình đưa dự đốn nhãn Luận văn chọn nhãn nhãn có xác suất cao Nhận xét: Hình 5.4: Độ xác độ lỗi mơ hình KieuNLI_PhoBERT Hình 5.4 thể độ xác độ lỗi mơ hình KieuNLI_PhoBERT q trình huấn luyện tính đến Ta thấy mơ hình có Độ xác mịn So với mơ hình KieuNLI, giá trị khơng biến thiên nhiều Độ xác mơ hình dùng để đánh giá q trình huấn luyện có xu hướng tăng tới 75% bão hịa Bên cạnh giá trị lỗi mơ hình q trình huấn luyện q trình đánh giá mịn so với mơ hình KieuNLI, có xu hướng giảm từ 1.0 xuống 0.6, độ lỗi cịn cao so với mơ hình KieuNLI giá trị lại thấp nhiều 59 Hình 5.5: Confusion Matrix mơ hình KieuNLI_ PhoBERT tập liệu kiểm thử Hình 5.5 trực quan hóa kết mơ hình tập liệu kiểm thử, với tồn ba nhãn dự đốn xác 50% Cụ thể, mơ hình KieuNLI_PhoBERT dự đốn cặp câu có nhãn Được suy luận chiếm 75%, nhãn Khơng suy luận 61% cịn với nhãn Khơng liên quan tỉ lệ phần trăm dự đốn xác lên đến 93% Có thể thấy việc áp dụng mơ hình LSTM với mơ hình pre-trained PhoBERT cải thiện độ xác nhiều so với mơ hình KieuNLI thực nghiệm trước Bảng 5.2 thể giá trị Độ xác cụ thể tập huấn luyện tập kiểm thử liệu “Kiều” hai mơ hình KieuNLI mơ hình KieuNLI_PhoBERT Bảng 5.2: Độ xác tập huấn luyện tập kiểm thử "Kiều" mô hình Model Train (% acc) Test (% acc) Kieu NLI 94.8 60.4 Kieu NLI_PhoBERT 83.2 75.6 60 CHƯƠNG KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN Trong bối cảnh thời đại mới, thông tin coi tài sản quý giá, phận thiếu với mặt hoạt động xã hội NLI nói riêng NLP nói chung có vai trị ngày quan trọng đóng góp nhiều thành tựu có giá trị phục vụ hoạt động đời sống xã hội ứng dụng tìm kiếm, tóm tắt thơng tin, kiểm lỗi tả hay dịch thuật tự động, Trong trình nghiên cứu luận văn đạt số kết định nhận thấy số hướng phát triển thêm 6.1 Kết đạt Với nghiên cứu này, luận văn cung cấp nhìn sở lý thuyết xử lý ngôn ngữ tự nhiên, suy luận ngôn ngữ tự nhiên, vấn đề liên quan toán phân tách từ tiếng Việt, phương pháp tiếp cận NLI phương pháp học biểu diễn ngôn ngữ Trên sở này, luận văn trình bày loạt cơng trình nghiên cứu khoa học liên quan, đồng thời phân tích, đánh giá hiệu số hạn chế tồn chưa giải được, đặc biệt tập liệu SNLI Tìm hiểu mơ tả mơ hình học sâu Transformer BERT, biến thể BERT, thuật tốn mơ hình cốt lõi để xây dựng mơ hình suy luận ngơn ngữ tự nhiên tiếng Việt dự đoán mối quan hệ câu tiền đề câu giả thuyết Luận văn xây dựng thành công tập liệu “Kiều” với tập tiền đề 1627 cặp câu thơ lục bát tác phẩm “Truyện Kiều” Nguyễn Du 4881 câu giả thuyết tương ứng tập giả thuyết Thực xác thực 1020 cặp câu chiếm tỉ lệ khoảng 20% liệu tập “Kiều” Tập liệu hạn chế ưu điểm quy mơ nhỏ đạt độ xác chấp nhận, mở hướng nghiên cứu không phụ thuộc nhiều vào liệu Bên cạnh đó, hiệu học tốt mơ hình tập liệu chứng minh phần câu từ “Truyện Kiều” bao hàm phần lớn ngữ nghĩa tiếng Việt Xây dựng hoàn thiện hai mơ hình suy luận ngơn ngữ tự nhiên tiếng Việt bao gồm mơ hình KieuNLI mơ hình KieuNLI_PhoBERT Tiếp thực 61 huấn luyện tập liệu hai mơ hình xây dựng, đưa kết thực nghiệm thực so sánh hai mô hình 6.2 Hướng phát triển Đề tài mong muốn tiếp tục phát triển hai mơ hình suy luận ngôn ngữ tiếng Việt thông qua việc cải tiến tham số huấn luyện mơ hình kết hợp lớp ẩn cách có hiệu để mơ hình cho kết tối ưu tiếp tục nghiên cứu trình độ nghiên cứu sinh Thu thập xác thực nhiều nguồn liệu đầu vào nhằm thử nghiệm tính khách quan mơ hình, đảm bảo mơ hình có khả hiểu nghĩa thật câu tiếng Việt thể việc dự đốn xác mối quan hệ cặp câu tiền đề giả thuyết Qua ứng dụng vào nhiệm vụ từ truy xuất thông tin, phân tích kỹ thuật lập luận thơng thường 62 TÀI LIỆU THAM KHẢO [1] Q Đ Phạm, Bói Kiều Như Một Nét Văn Hóa NXB Văn Hóa Sài Gòn, 2004 [2] S R Bowman, G Angeli, C Potts, and C D Manning, "A large annotated corpus for learning natural language inference," arXiv preprint arXiv:1508.05326, 2015 [3] B MacCartney, Natural language inference Stanford University, 2009 [4] C N Mai, N Đ Vũ, and P T Hoàng, Cơ sở ngôn ngữ học tiếng Việt (in Tieng Viet), 2003 [5] StreetCodeVn "Cách tách từ cho Tiếng Việt." https://streetcodevn.com/blog/vntok (accessed 18, 2022) [6] S Meknavin, P Charoenpornsawat, and B Kijsirikul, "Feature-based Thai word segmentation," in Proceedings of Natural Language Processing Pacific Rim Symposium, 1997, vol 97: Citeseer, pp 41-46 [7] L A Ramshaw and M P Marcus, "Text chunking using transformation-based learning," in Natural language processing using very large corpora: Springer, 1999, pp 157-176 [8] M Mohri, F Pereira, and M Riley, "Weighted finite-state transducers in speech recognition," Computer Speech & Language, vol 16, no 1, pp 69-88, 2002 [9] A T Vu, D Q Nguyen, D Q Nguyen, M Dras, and M Johnson, "VnCoreNLP: Vietnamese natural language processing toolkit," arXiv preprint arXiv:1801.01331, 2018 [10] T Trần "Nguyễn Du Truyện Kiều giá trị trường tồn vượt không gian thời gian." 2020 http://hatinh.edu.vn/phong-truyen-thong/nguyen-du-va-truyenkieu-gia-tri-truong-ton-vuot-khong-gian-.html (accessed 21, 2022) [11] M Rato, "Filial Piety and Chastity in Nguyen du’s The Tale of Kieu," Manusya: Journal of Humanities, vol 10, no 4, pp 66-75, 2007 63 [12] I Dagan, O Glickman, and B Magnini, "The pascal recognising textual entailment challenge," in Machine learning challenges workshop, 2005: Springer, pp 177-190 [13] E Marsi and E Krahmer, "Classification of semantic relations by humans and machines," in Proceedings of the ACL workshop on Empirical Modeling of Semantic Equivalence and Entailment, 2005, pp 1-6 [14] M.-C De Marneffe, B MacCartney, and C D Manning, "Generating typed dependency parses from phrase structure parses," in Lrec, 2006, vol 6, pp 449-454 [15] S Harabagiu and A Hickl, "Methods for using textual entailment in open- domain question answering," in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006, pp 905-912 [16] A Hickl and J Bensley, "A discourse commitment-based framework for recognizing textual entailment," in Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, 2007, pp 171-176 [17] B MacCartney and C D Manning, "Modeling semantic containment and exclusion in natural language inference," in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 2008, pp 521-528 [18] M Marco, B Luisa, B Raffaella, M Stefano, and Z Roberto, "SemEval-2014 Task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment," in Proc SemEval, 2014, pp 18 [19] M Marelli, S Menini, M Baroni, L Bentivogli, R Bernardi, and R Zamparelli, "A SICK cure for the evaluation of compositional distributional semantic models," in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 2014, pp 216-223 [20] A Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol 30, 2017 64 [21] Q Phạm "Tìm hiểu mơ hình Transformer - Ngươi Khơng Phải Là Anh Hùng, Ngươi Là Quái Vật Nhiều Đầu." https://pbcquoc.github.io/transformer/ (accessed 21, 2022) [22] J Devlin, M.-W Chang, K Lee, and K Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018 [23] Q H Phạm "Hiểu BERT: Bước nhảy lớn Google." https://viblo.asia/p/hieu-hon-ve-bert-buoc-nhay-lon-cua-google-eW65GANOZDO (accessed 21, 2022) [24] D Q Nguyen and A T Nguyen, "PhoBERT: Pre-trained language models for Vietnamese," arXiv preprint arXiv:2003.00744, 2020 [25] Y Li et al., "Association rule-based feature mining for automated fault diagnosis of rolling bearing," Shock and Vibration, vol 2019, 2019 [26] D Q Nguyen, D Q Nguyen, T Vu, M Dras, and M Johnson, "A fast and accurate Vietnamese word segmenter," arXiv preprint arXiv:1709.06307, 2017 [27] R Sennrich, B Haddow, and A Birch, "Neural machine translation of rare words with subword units," arXiv preprint arXiv:1508.07909, 2015 [28] H V Lê, Truyện Kiều giải NXB Ziên Hồng, 1956 [29] J L Fleiss, "Measuring nominal scale agreement among many raters," Psychological bulletin, vol 76, no 5, p 378, 1971 [30] S Hochreiter and J Schmidhuber, "Long short-term memory," Neural computation, vol 9, no 8, pp 1735-1780, 1997 [31] I Sutskever, J Martens, and G E Hinton, "Generating text with recurrent neural networks," in ICML, 2011 [32] O Vinyals, S V Ravuri, and D Povey, "Revisiting recurrent neural networks for robust ASR," in 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2012: IEEE, pp 4085-4088 Ụ LỤC 65 Natural language inference model for the Vietnamese P HỤlearning LỤC language with machine algorithms: a view from “Truyện Kiều” Tran Thi Mai Trang Information Technology Faculty HCMC University of Technology and Education Ho Chi Minh, Viet Nam 1981310@student.hcmute.edu.vn Abstract—Inference models in natural language predict whether a hypothesis sentence can be inferred from a premise sentence or not Understanding the inference properties of natural language can elevate the representation of the semantics context Many studies on this field are in English, but few are in Vietnamese So, we introduce the Natural Language Inference corpus in Vietnamese using the Tale of Kieu, “Truyện Kiều”, which Nguyễn Du wrote The dataset includes the premise of the poems in “Truyện Kiều”; the hypothesis is that the conclusions are gathered from the “Kieu fortune-telling” We make the following contributions: (i) provided the corpus for applying machine learning to semantic representation in the Vietnamese language; (ii) inference models in a natural language using machine learning method to predict the relationship between premise and hypothesis sentences Finally, our research helps keep the traditional beauty of “Truyện Kiều” in general and the culture of “Kieu fortune-telling” in particular Index Terms—natural language inference, textual entailment, Truyện Kiều Vietnamese corpus, deep learning I Introduction A Natural language inference Since the dawn of artificial intelligence (AI), natural language processing (NLP) has been a fascinating research topic NLP is a set of theoretically motivated computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to build computer systems and programs that can communicate with people using natural language rather than programming languages or machine languages Understanding a language is one of the most challenging difficulties in bridging the gap between human-computer communication because computers tend only to grasp things in the literal sense of the word To solve this problem, NLP models use a training data preprocessing mechanism that includes sentence-level tasks such as natural language inference and interpretation Natural language inference (NLI) studies whether a natural language hypothesis (h) can be legitimately deduced from the natural language premise Huynh Xuan Phung* Information Technology Faculty HCMC University of Technology and Education Ho Chi Minh, Viet Nam phunghx@hcmute.edu.vn (p) [1] NLI, also known as textual entailment recognition, is used in various applications such as information retrieval, translation, automatic text summarization, and automatic spell checking The logical relationship between two text strings is determined by NLI, which usually falls into one of three categories: • Entailment: hypothesis can be inferred from the premise • Contradiction: hypothesis cannot be inferred from the premise • Neutral: other cases The challenges of NLI include: the emphasis is on informal reasoning, lexical-semantic knowledge, and variability of linguistic expression, rather than on long chains of formal reasoning [1], especially for Vietnamese, one of the most challenging languages in the world B The Tale of Kieu by Nguyễn Du Nguyễn Du (1765 - 1820) was born in Thang Long (today it is Hanoi city) to a noble and illustrious family, many generations of mandarins, and had a tradition in literature Nguyễn Du’s life was deeply associated with historical events of the late 18th – 19th centuries “Truyện Kiều” (The Tale of Kieu) by Nguyễn Du was inspired by the story of “Kim Vân Kiều” (Jīn Yún Qiào), by a Chinese author, Qīngxīn Cáirén (Vietnamese: Thanh Tâm Tài Nhân) Nguyễn Du borrowed the Chinese social context of the Ming Dynasty to paint a panorama of the life of the era in which the poet was living “Truyện Kiều” was written in [the Vietnamese] “six-eight” verse, in the folk meter known as “luc bat” It was a popular form of poetry and can be understood by everyone, from farmers to learned people The story consists of 254 verses about the life of 15 years of wandering and floating of Thuý Kiều, a talented and perfect daughter who had to sell herself to redeem her father From then on, catastrophe after catastrophe fell on her She had to follow her fate: being cheated, two times being held in a pleasure house (whorehouse) as a singer, concubine, servant, and trampled by feudal forces Nguyễn Du, through his characterization of Kieu, knotted 66 the tie of love with filial duty and revealed the ambiguity of the Confucian concept of female chastity [2] The topics of love, Confucian morality, and social obligation drawn from The Tale of Kieu in the early 20th century public debate provided a realm for the liberation of ideas for the new generation of Vietnamese intellectuals They removed the limitations of Confucian ideology [2] Nguyễn Du cleverly integrated the quintessence of studied linguistics with the quintessence of popular languages in terms of artistic value In Kieu’s Story, Vietnamese and ‘luc bat’ poetry have reached the pinnacle of poetry; it is also the crystallization of national literature’s achievements in language and genre, with overlapping and general knowledge of languages, idioms, proverbs, historical references, and names dating back more than 200 years Kieu’s poetry language is vibrant, expressive, and musical, capable of depicting landscapes and portraits and states of mind, feelings, and emotions in only a few verses that reverberate throughout the Vietnamese heart and spirit The author, via Kieu’s Story, exposed the face of an unjust and violent feudal society while also reflecting the misery and unhappiness of individuals, particularly women, in Vietnamese society at the time “Truyện Kiều” is also a voice that promotes a love of freedom, yearning for justice, and human beauty More than 35 translations of Kieu’s Story into more than 20 languages worldwide have demonstrated the book’s widespread impact and longterm worth beyond the confines of the workplace C Research objectives For hundreds of years, with many great values, “Truyện Kiều” has continuously been widely circulated and has the power to conquer all classes of readers from intellectuals to ordinary people of generations of Vietnamese people From infatuation to belief, “Kieu fortune-telling”-to tell fortunes from the text of “Truyện Kiều” appeared: “Lạy vua Từ Hải, lạy vãi Giác Duyên, lạy tiên Thúy Kiều, xin cho ba dòng.” (“Dear King Tu Hai, Dear Monk Giac Duyen, Dear Fairy Thuý Kiều, please give me three lines.”) “Kieu fortune-telling” is based on random pairs of poems in “Truyện Kiều” that the reader understands, with one sentence being the cause and the other being the effect The cause is a piece of counsel for mortals on how to behave and practice, and the effect might be a result that has occurred in the present or a prediction of what will occur in the future It is not the same as blind faith or superstition but rather a fulcrum of faith for those suffering terrible circumstances The reader will draw the matching hypotheses about his fate from these pairings of poetry, which we call premises Example with a pair of premise poems: “Cửa vội rủ rèm the, Xăm xăm băng lối vườn khuya mình.” (“In a rush, she lowered the door’s silk curtain, And hastened her pace towards the midnight garden.”) The reader may have the following hypotheses and relationships: • Bạn người nhanh nhẹn, dũng cảm, đoán (You are agile, brave, and decisive.) - entailment • Bạn người chậm chạp, nhát gan (You are slow and cowardly.) - contradiction • Hoa violet màu xanh (Violets are blue.) - neutral The paper intends to construct a data set from the above content, consisting of premises that are pairs of poems in “Truyện Kiều” with the hypothesis that the conclusions are derived from “Kieu fortune-telling” Then, using machine learning algorithms, develop an inference model and test it against the built Kieu corpus Furthermore, this study contributes to the preservation of Nguyễn Du’s cultural system of “The Tale of Kieu” in general and the culture of “Kieu fortune-telling” in particular D The problem of word segmentation in Vietnamese A system that cannot detect the meaning of a sentence cannot claim to understand the sentence: the NLI task is a stringent test of a system’s language processing ability It is necessary to determine which words are in the sentence based on their structure Word is the smallest meaningful unit of language that is used independently and freely reproduced in a speech to build sentences [3] However, the basic unit of word formation in Vietnamese is syllables, a word in Vietnamese can consist of one or more syllables A word in Vietnamese can consist of one or more syllables; spaces not define word boundaries; thus, there are many ways to divide or combine sounds into words, causing ambiguity For example, the word “mặt trời” (sun) is made up of two syllables “mặt” (face) and “trời”(heaven), both of these syllables have their own meanings when standing alone, but when combined, they will have a different meaning Alternatively, we have “nhà hàng” (restaurant) is made up of two syllables “nhà” (house) and “hàng” (product) More specifically, about 85% of Vietnamese word types are composed of at least two syllables and 80% of syllable types are words by themselves [4], [5] The problem solving of this question is called Word Segmentation; its purpose is to determine the boundaries of words in a sentence, or in other words, adjacent group syllables into a meaningful word For example: “Đầuhead lòngentrails haitwo ảshe tốaccuse ngaRussia , Thúy Kiều làis chịelder , emyounger làis Thúy Vân.” after “Word Segmentation” will become: “Đầu_lòngFirst haitwo ảdaughters tố_ngamagnificent , Thuý_Kiều làis chịelder , emyounger làis Thuý_Vân.” Formally, single words are grouped together by concatenating each other with the underscore character “_”, after performing the word segment, a space separates each 67 word (token) in the sentence The goal of segmentation from the input text is to eliminate the semantic ambiguity of the text For Vietnamese, when spaces not define the boundaries among words, the semantics of a sentence will depend a lot on word order and word separation [6] Many practical approaches have been studied to resolve word segmentation tasks [7] Many methods for word segmentation have been proposed These methods can be classified as either dictionary-based or statistical methods, while many state-of-the-art systems use hybrid approaches [8] For this paper, we use VnCoreNLP which is a tool that supports key NLP tasks, including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER), dependency parsing, and obtaining state-ofthe-art (SOTA) results for these tasks [9] Highlights of VnCoreNLP: easy-to-use, fast, accurate [9] II Related scientific research works Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for developing semantic representations [10] However, machine learning research in this area has been dramatically limited by the lack of large-scale resources [10] The primary sources of annotated NLI corpora have been the Recognizing Textual Entailment (RTE) challenge tasks [11] Initially, the RTE-1 dataset consisted of manually collected text pairs, called text (t) (1-2 sentences) and hypothesis (h) (1 sentence) Participating systems were asked to evaluate for each pair whether t entails h [11] Following RTE-1, a second competition (RTE-2) was created to support the continued study of text traction The main focus was on generating the RTE-2 dataset to provide more “realistic” text hypothetical examples, based primarily on the outputs of real systems [12] However, its effectiveness was limited to the problem of semantic structure representation Stanford RTE was one of the first NLI systems to create a clear separation between alignment and entailment determination developed over several years by a large team, including Bill MacCartney, Christopher D Manning, and many others [1], [13] This system used typed dependency trees as a proxy for semantic structure, and sought a low-cost alignment between trees for p and h, using a cost model incorporating both lexical and structural matching costs [1] Stanford RTE was typical of a kind of NLI approach based on approximate graph matching, achieving significantly higher accuracy than simple models such as the bag of words NatLog was the first robust, general-purpose system for natural logic inference over actual English sentences [14] The NatLog system decomposed an inference problem into a sequence of atomic edits, which transforms p into h; predicted a lexical entailment relation for each edit using a statistical classifier; propagated these relations upward through a syntax tree according to semantic properties of intermediate nodes, and joined the resulting entailment relations across the edit sequence [1] Adding NatLog as a component in the Stanford RTE system resulted in a significant increase in accuracy [14] The resources mentioned above suffer from a subtler issue It impacts even projects using only human-provided annotations Hence, their small size (each with fewer than a thousand samples) limits their utility as a testbed for learned distributed representations simultaneously The indeterminacies of event and entity coreference lead to insurmountable indeterminacy concerning the correct semantic label [15], [16] Example: “Landmark is the tallest building in Ho Chi Minh City” and “Landmark is the tallest building in Saigon”, this pair of sentences could be labeled as entailment if one assumes that the two cases refer to the same entity, but could also be reasonably considered neutral if that assumption is not made As a result, several evaluation datasets have been introduced, most recently the Stanford Natural Language Inference corpus With The Stanford Natural Language Inference Corpus (SNLI), they sought to address the issues of size, quality, and indeterminacy SNLI was a collection of 570K sentence pairs labeled for entailment, contradiction, and semantic independence, written by humans doing a novel grounded task based on image captioning [10] To this, the researchers used crowd-sourcing with critical new features, such as Examples were based on specific situations (based on image captioning), so premise and hypothesis described the situation from the same perspective; The prompt gave participants the freedom to create entirely new sentences, resulting in richer examples without sacrificing consistency; Provided a set of highly reliable annotations on the same data and defined the region of inference with uncertainty[10] In order to measure the quality of the corpus and to create the most helpful test and development suites, SNLI has performed additional validation for about 10% of the data The SNLI corpus is high quality to help develop models for NLI tasks and is suitable for training parameter-rich models such as neural networks SNLI can significantly improve performance on a standard challenge dataset, offering the hope of providing valuable training data and as a challenging testbed for the continued application of machine learning to semantic representation [10] III Kieu corpus Up to now, there have been many studies on the NLI model as well as quality pre-training data sets such as SNLI in English, but little research in Vietnamese, so we would like to build the dataset “Kieu” based on the idea of building SNLI corpus, from which to build inference models using machine learning algorithms to evaluate the models on this data set Written in the form of a Nôm poem, including 254 verses in hexagonal, “Truyện Kiều” is a collection of words and phrases with provisional meaning, it represents Nguyễn Du’s linguistic creativity 68 and is an inevitable product of history With the corpus of NLI in Vietnamese is using the data “Truyện Kiều”, we try to solve the problems of quality and indeterminacy because the verses in the poem have a particular situation scenario, the premise and hypothesis sentences were constrained to describe that scenario from the same angle, helping to control the event and entity In addition, a subset of the “Kieu” data set was sent to a validation task that provided a set of highly reliable annotations on the same data and to identify areas of potential for uncertainty inference A Data collection The data set includes a set of premises including 627 pairs of hexagonal verses in “Truyện Kiều”, a set of hypotheses about fate and human life collected from “Kieu fortune-telling” and created by the author based on “Truyện Kiều giải” (Truyện Kiều Annotated) by Lê Văn Hoè, the corresponding hypotheses for each of the three labels include entailment, contradiction, neutral Table I shows a set of randomly selected examples from the Kieu corpus that includes the premise and their hypotheses in all three labels After being built, the dataset will be normalized by checking for spelling and punctuation errors We use this dataset to evaluate various models for NLI, including rules-based systems, simple linear classifiers, and neural network-based models [10] Table II reports some critical statistics about the collected corpus and Figure shows the distribution of sentence lengths (tokens) for both our hypotheses and premises Because the premise sentences are pairs of hexagonal sentences in “Truyện Kiều”, the length is about thirteen words In addition, although hypothetical sentences need to interpret much information from the line of the poem in order to make a clear judgment, they still tend to be as short as possible, about eight words in length B Data validation To measure the corpus’s quality and create the most useful Test and Development sets, “Truyện Kiều” would be additionally validated for 20% of the data This validation phase followed the same basic form as the SNLI internship validation process, we sent each pair of premise poems with corresponding hypothetical sentences to four reviewers and asked them to choose a single label for each pair out of a total of three original labels and one marker label for the problematic pair of sentences The data validation instructions are shown in Figure and linked to a FAQ If at least three reviewers chose a label, that label was labeled “gold”; otherwise, the data was labeled “-” These unlabeled examples are not likely to be helpful for the standard NLI classification task, so we edit the content of the pairs of sentences and put them back in training in the experiments that we discuss in this paper Fig The distribution of sentence length of 627 pairs of premise poems and 881 corresponding hypotheses from Kieu coprus The results of this validation process are summarized in Table III Almost all of the examples received the majority label, indicating a large consensus on the nature of the data and labels The gold-labeled examples are distributed roughly evenly across the three labels The Fleiss k score (calculated on evaluated examples) is very evenly distributed with a higher overall agreement rate, indicating that Kieu corpus is sufficiently high quality to pose a challenging but realistic machine learning task The Kieu corpus has three splits: train, validation and test All validation and test samples come from the validated hypothesis set with the null-consensus hypotheses removed Each premise appears in only one split IV Natural language inference models for Vietnamese A KieuNLI model For the purpose of building an inference model when implemented in practice, it allows users to enter any premise and hypothesis, and the output is the relationship between those two sentences, respectively: Entailment, Contradiction, and Neural Initially, this paper builds a Vietnamese NLI model named KieuNLI based on the SNLI model [10] The input consists of a premise sentence, with a maximum length of 300 words, and a hypothesis sentence, with a maximum length of 300 words after being encoded by phoBERT Since the premise is a pair of poems with a length of to words, the maximum input for the premise is 300 words This will help prevent data loss After the surveys, the real hypothesis sentences can be 100 to 200 words long, so the input of the hypothesis is 300 words, which also restricts the receiving model from not running out of data After selecting the input layers, KieuNLI model will concatenate these two inputs and go over three 300D ReLU layers The paper will choose these 69 TABLE I Randomly chosen examples from the Kieu corpus include the premise and corresponding hypotheses in all three labels Premise Labels Hypotheses Nhặt thưa gương gối đầu cành Ngọn đèn trông lọt trướng huỳnh hắt hiu Entailment Contradiction Neutral Tâm trạng cô đơn, buồn chán Tâm trạng vui vẻ, hạnh phúc Giá vàng ngày tăng cao Sinh vừa tựa án thiu thiu Dở chiều tỉnh dở chiều mê Entailment Contradiction Neutral Bạn mệt mỏi, nghỉ ngơi Bạn tràn đầy sức sống Giá xăng vừa tăng 3000 VNĐ Lạ bỉ sắc tư phong, Trời xanh quen thói má hồng đánh ghen Entailment Contradiction Neutral Được điều bị điều kia, bị ghen ghét Đạt điều mong muốn Bạn người gái đẹp Cảo thơm lần giở trước đèn, Phong tình cổ lục cịn truyền sử xanh Entailment Contradiction Neutral Có chuyện tình đẹp Chuyện tình cảm phát triển khơng thuận lợi Món sủi cảo ăn ngon Rằng năm Gia Tĩnh triều Minh, Bốn phương phẳng lặng, hai kinh vững vàng Entailment Contradiction Neutral Cuộc sống trơng bình n, vững vàng, đất nước thái bình Sắp có xáo trộn sống Được du lịch TABLE II Key statistics for the raw sentence pairs in “Truyện Kiều” Data set sizes: Training pairs Development pairs Test pairs 909 486 486 Sentence length: Premise mean token count 12.9 Hypothesis mean token count 9.7 TABLE III Statistics for the validated pairs The author of this article used the author’s label who wrote the hypotheses to create the sentence pair A gold label reflects a consensus of three votes between the author and the four annotators General: Validated pairs Pairs w/ unanimous gold label 020 76.7% Individual annotator label agreement: Individual label = gold label 94.1% Individual label = author’s label 93.1% Gold label/author’s label agreement: Gold label = author’s label 98% Gold label 6= author’s label 1.7% No gold label (no labels match) 0.3% Fleiss : contradiction entailment neutral Overall 0.8456 0.8452 0.8451 0.8453 three ReLU layers because they have good effects in the experiment process The last layer of the model is the softmax layer, whose input is the predicted probability of the labels The Figure visualizes the paper’s proposed model In summary, KieuNLI model can be briefly explained as follows: • Combination of two 300D Sum Of Word models of Premise and Hypothesis • Encode the premise and hypothesis sentences using the same set of encoders (phoBERT and BERT) PhoBERT is the first large-scale monolingual pretrained model for Vietnamese [17] • Our neural network classifier is simply a stack of three layers of 300D ReLU layers • The top layer feeding a softmax classifier with activation function is softmax with outputs consisting of labels entailment, contradiction, and neutral B Experimental As described above, producing a Vietnamese dataset is more complicated than developing one for the English language The text will first go through a preprocessing stage to normalize it and remove noisy information or of little significance, such as punctuation marks and special characters like HTML and JavaScript tags Then, because the Vietnamese not have a clear border for the distance between words, the word segmentation problem must be solved, which is a critical issue that impacts the program’s accuracy In addition, before encoding and feeding the input into the training process, this project filters out specific terms that frequently appear in the dataset to select and delete nonsensical words Next, from this raw data, we use BPE (Byte Pair Encoding) [18] to help index all words, including the case of Open words (not appearing in the dictionary) by encoding words with a sequence of extra words, thereby limiting obtain a substantial number of tokens representing the word that has never appeared 70 Fig The architecture of KieuNLI Fig Instructions for data validation of Kieu coprus in Vietnamese and English before Then, we map the data files together to create lists containing the separated premises and hypothesises data, with their list labels being 0, 1, corresponding to labels (entailment, contradiction, neutral) Finally, perform the model training by passing the word vector through ReLU layers and making the correct prediction that the label is the one with the highest probability The results of the KieuNLI model are shown in Figure As can be seen, the KieuNLI model predicts sentences with labels of entailment accurately in 79 percent of cases, contradictions correctly in 41 percent of cases, and neutral sentences correctly in 73 percent of cases There are still a lot of sentence pairs that were erroneously predicted Due to the model’s incomplete understanding of the meaning of the phrases, there are still many confusions, particularly between the two labels of entailment and contradiction Fig Confusion Matrix of the Kieu corpus on KieuNLI model for Vietnamese First off, there are numerous terms in Vietnamese that are unclear in meaning due to ambiguity, as well as many homonyms, synonyms, and dialects Especially for “Truyện Kiều” a piece that contains mixed-up and redundant knowledge of several different languages, idioms, proverbs, historical allusions, and names that date back more than 200 years Example with a premise sentence: “Lạ_gìNo wonder bỉ_sắc_tư_phongunlucky in other ways , Trời_xanhBlue Sky quen_thóihabit má_hồngRosy cheeks đánh_ghenjealousy ” (“No wonder: Fate never favours one twice, Rosy cheeks must face the hatred from the Blue Sky.”) The idiom “bỉ_sắc_tư_phong”, which is made up of four Chinese characters, is used in this set of poems to signify that if the Creator adds one, he will reduce the other a little, never giving anyone enough No one can achieve everything they want Similar to the word “đánh_ghen”, 71 the model can be misunderstood as the action of hitting even though it refers to jealousy Therefore, the relationship of this premise to the hypothesis sentence that “You are a beautiful woman” must be labeled neutral Additionally, the model may struggle to interpret the meaning of phrases due to a paucity of training data, suggesting that paying closer attention to compositional semantics might be beneficial However, many enduring issues go deeper, to conclusions based on general knowledge and contextspecific assumptions This highlights the requirement for a model with direct access to syntactic or semantic structure data More research can be conducted in the future to enhance KieuNLI model and produce better outcomes Using LSTM modeling, for instance, or phoBERT pretraining V Conclusion This paper sought to overcome the limitation of NLI in Vietnamese with a new, natural corpus of labeled sentence pairs Kieu corpus can be used to improve performance on the standard challenge dataset significantly At the same time, we also provided an inference model using machine learning algorithms to evaluate models on the built dataset In addition, the paper also contributed to preserving the beauty of Nguyễn Du’s “Truyện Kiều” culture in general as well as the culture of “Kieu fortunetelling” in particular References [1] B MacCartney, Natural language inference Stanford University, 2009 [2] M Rato, “Filial piety and chastity in nguyen du’s the tale of kieu,” Manusya: Journal of Humanities, vol 10, no 4, pp 66 – 75, 2007 [Online] Available: https://brill.com/view/journals/ mnya/10/4/article-p66_5.xml [3] N C Mai, D N Vu, and T P Hoang, Co so ngon ngu va tieng Viet Giao duc, 2008 [4] Q T Dinh, H P Le, T M H Nguyen, C T Nguyen, M Rossignol, and X L Vu, “Word segmentation of vietnamese texts: a comparison of approaches,” in 6th international conference on Language Resources and Evaluation-LREC 2008, 2008 [5] N T M Huyen, A Roussanaly, H T Vinh et al., “A hybrid approach to word segmentation of vietnamese texts,” in International conference on language and automata theory and applications Springer, 2008, pp 240–249 [6] N K Pham, N M T Tran, T P Pham, and T N Do, “Su anh huong cua phuong phap tach tu bai toan phan lop van ban tieng viet,” in Ky yeu Hoi nghi Khoa hoc Quoc gia lan thu IX - Fundamental And Applied IT Research (FAIR’9), 2017 [7] Q T Dinh, H P Le, T M H Nguyen, C T Nguyen, M Rossignol, and X L Vu, “Word segmentation of Vietnamese texts: a comparison of approaches,” in 6th international conference on Language Resources and Evaluation - LREC 2008 Marrakech, Morocco: ELRA - European Language Resources Association, May 2008 [Online] Available: https: //hal.inria.fr/inria-00334760 [8] J Gao, M Li, C.-N Huang, and A Wu, “Chinese word segmentation and named entity recognition: A pragmatic approach,” Computational Linguistics, vol 31, no 4, pp 531–574, 2005 [9] T Vu, D Q Nguyen, D Q Nguyen, M Dras, and M Johnson, “VnCoreNLP: A vietnamese natural language processing toolkit,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations Association for Computational Linguistics, 2018 [Online] Available: https: //doi.org/10.18653%2Fv1%2Fn18-5012 [10] S R.Bowman, G Angeli, C Potts, and C D.Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics, 2015 [11] I Dagan, O Glickman, and B Magnini, “The pascal recognising textual entailment challenge,” in Machine Learning Challenges Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, J Quiñonero-Candela, I Dagan, B Magnini, and F d’Alché Buc, Eds Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp 177–190 [12] R B Haim, I Dagan, B Dolan, L Ferro, D Giampiccolo, B Magnini, and I Szpektor, “The second pascal recognising textual entailment challenge,” in Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 2006 [13] M.-C De Marneffe, B MacCartney, C D Manning et al., “Generating typed dependency parses from phrase structure parses.” in Lrec, vol 6, 2006, pp 449–454 [14] B MacCartney and C D Manning, “Modeling semantic containment and exclusion in natural language inference,” in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 2008, pp 521–528 [15] M.-C De Marneffe, A N Rafferty, and C D Manning, “Finding contradictions in text,” in Proceedings of ACL-08: HLT, 2008, pp 1039–1047 [16] M Marelli, S Menini, M Baroni, L Bentivogli, R Bernardi, R Zamparelli et al., “A sick cure for the evaluation of compositional distributional semantic models.” in Lrec Reykjavik, 05 2014, pp 216–223 [17] D Q Nguyen and A T Nguyen, “Phobert: Pretrained language models for vietnamese,” arXiv preprint arXiv:2003.00744, 2020 [Online] Available: https://arxiv.org/ abs/2003.00744 [18] R Sennrich, B Haddow, and A Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015 72 S K L 0

Định dạng
Số trang	82
Dung lượng	11,75 MB