Phương pháp K-Fold Cross-Validation

+ Tiếng Việt Accuracy (%) ME ME + TBL Fold 1 94,3787 95,60548 Fold 2 94,2905 95,42824 Fold 3 94,2472 95,45703 Fold 4 94,3299 95,51637 Fold 5 94,2221 95,41178 Fold 6 94,3881 95,13798 Fold 7 94,3591 95,6928 Fold 8 94,2322 95,60124 Fold 9 94,3473 95,44316 Fold 10 94,5227 95,72672 Trung bình 94,3318 95,50208

Bảng 4-9: Độ chính xác trên ngôn ngữ tiếng Việt với phương pháp K-Fold + Tiếng Anh Accuracy (%) ME ME + TBL Fold 1 97,2099 97,2561 Fold 2 97,47282849 97,57002739 Fold 3 97,34704 97,42482 Fold 4 97,08126 97,15224 Fold 5 97,14311 97,2058 Fold 6 97,29132 97,41524 Fold 7 97,27654 97,32119 Fold 8 97,34803 97,43413 Fold 9 97,38270308 97,4702381

Fold 10 97,30203 97,35459

Trung bình 97,28548 97,36044

Bảng 4-10: Độ chính xác trên ngôn ngữ tiếng Anh với phương pháp K-Fold

Quá trình thực nghiệm trên mô hình cơ sở và mô hình kết hợp cho bài toán gán nhãn từ loại sử dụng 2 kho dữ liệu và 2 tập nhãn tương ứng khác nhau trên cùng một môi trường thực nghiệm có thể đưa ra một số nhận xét như sau:

- Kết quả thực nghiệm cho thấy tính khả quản của các hướng tiếp cận tích hợp cho bài toán gãn nhãn từ loại, đặc biệt cho các ngôn ngữ mà kho ngữ liệu chưa “đầy đủ” như Tiếng Việt. Dù thời gian cho việc huấn luyện mô hình lớn hơn so với các mô hình cơ sở nhưng kết quả đem lại được cải thiện một cách đáng kể chất lượng gán nhãn.

- Ưu điểm của mô hình kết hợp là kết hợp được rất nhiều đặc trưng phong phú đặc biệt các đặc trưng hiếm mà trong mô hình cơ sở chưa giải quyết. Dù chưa có nhiều thời gian để xây dựng tập đặc trưng đủ tốt cho bài toán gãn nhãn từ loại nhưng kết quả đạt được là đáng ghi nhận.

- Thời gian huấn luyện của mô hình cơ sở và mô hình kết hợp phụ thuộc vào kích thước của ngữ liệu huấn luyện và ngữ liệu kiểm tra.

KẾT LUẬN

Kết quả đạt được

Trong các luận văn này chúng tôi đề xuất sử dụng mô hình kết hợp để giải quyết bài toán khử nhập nhằng. Với bài toán gán nhãn từ loại, chúng tôi kết hợp mô hình cực đại hóa Entropy và mô hình học luật chuyển đổi để giải quyết bài toán gán nhãn từ loại với những kết quả đạt được tiếng Việt là khoảng 95.50% (tăng khoảng 1.18%) và tiếng Anh là 97.40% (tăng khoảng 0.12%) so với mô hình cơ sở, cụ thể như sau:

- Tìm hiểu về các phương pháp học máy cực đại hóa Entropy và luật chuyển đổi trong ngữ cảnh bài toán gán nhãn từ loại. Đề xuất tập 30 mẫu luật chuyển đổi cho tiếng Việt phục vụ cho mô hình học luật chuyển đổi.

- Đề xuất một cách kết hợp giữa mô hình cực đại hóa Entropy và mô hình học luật chuyển đổi để giải quyết bài toán gán nhãn từ loại. Phát triển và xây dựng hoàn thành công cụ gán nhãn từ loại tiếng Việt dựa trên mô hình kết hợp trên ngôn ngữ JAVA. - Tiến hành thực nghiệm trên hai kho ngữ liệu Penn TreeBank và Viet TreeBank và kết quả thực nghiệm đạt được tốt hơn so với mô hình cơ sở cho thấy tính khả quản của các hướng tiếp cận kết hợp cho bài toán gãn nhãn từ loại, đặc biệt cho các ngôn ngữ mà kho ngữ liệu chưa “đầy đủ” như Tiếng Việt.

- Phát hiện và chỉnh sửa trên gần 300 câu trong kho ngữ liệu với hơn 400 lỗi ngoại lệ. - Chúng tôi cũng đã trình bày kết quả nghiên cứu của chúng tôi với bài báo “Improving Part-Of-Tagging using Maximum Entropy Models with Transformation Based Learning Models” và công bố một bài báo với hướng mô hình kết hợp đã đề xuất cho bài toán xử lý nhập nhằng nghĩa của từ “Combining Statistical Machine Learning with Transformation Rule Learning for Vietnamese Word Sense Disambiguation” tại hội nghị RIVF năm 2012.

Hướng phát triển

- Tiếp tục nghiên cứu và xây dựng tập đặc trưng phong phú hơn cho từng ngôn ngữ. - Xây dựng bổ sung thêm dữ liệu cho từ điển từ loại tiếng Việt nhằm cải thiện thời gian huấn luyện trong mô hình cực đại hóa Entropy.

- Nghiên cứu áp dụng mô hình kết hợp trên các bài toán khác trong xử lý ngôn ngữ tự nhiên.

- Nghiên cứu phương pháp cân bằng ngữ liệu trước khi thực hiện các thao tác huấn luyện hệ thống nhằm cải thiện chất lượng mô hình thống kê thu được sau quá trình huấn luyện.

DANH MỤC CÁC CÔNG TRÌNH LIÊN QUAN ĐẾN LUẬN VĂN

[1] Phu - Hung Dinh, Ngoc - Khuong Nguyen, Anh - Cuong Le. "Combining Statistical Machine Learning with Transformation Rule Learning for Vietnamese Word Sense Disambiguation". In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on, pp. 62-67. IEEE, 2012.

TÀI LIỆU THAM KHẢO

[1] M. P. Lewis, Ethnologue: Languages of the World, 16th edition, Ethnologue, 2009.

[2] P. T. T. C. H. T. Nguyễn Quang Châu, “Gán nhãn từ loại cho Tiếng Việt dựa trên văn phong và tính toán xác suất,” Tạp chí phát triển KH&CN, pp. Tập 9, số 2, 2006.

[3] Y. Halevi, "Part of Speech Tagging Slide," The Blavatnik School of Computer Science– Tel Aviv University, 25 April 2006.

[4] R. M. Paroubek P., "Etiquetage morpho-syntaxique," in Ingénierie des langues, Hermes Science Europe, 2000, p. Chapitre 5.

[5] B. E., "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging," Computational Linguistics,

vol. 21, no. 4, pp. 543-565, December 199.

[6] K. G. Dermatas E., "Automatic Stochastic Tagging of Natural Language Texts,"

Computational Linguistics, vol. 21, no. 2, pp. 137 - 163, 1995.

[7] S. H., "Part-of-Speech Tagging with Neural networks," in International Conference on Computational Linguistics, Kyoto, Japan, 1994.

[8] S. T. El-Bèze M, "Etiquetage probabiliste et contraintes syntaxiques," in Actes de la conférence sur le Traitement Automatique du Langage Naturel (TALN95), Marseille, France, 14-16/6/1995.

[9] T. D., "Tiered Tagging and combined classifier," In Jelineck F. and Nörth E. (Eds),Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999.

[10] H. L. S. H. M. Robert Dale, Handbook of Natural Language Processing, New York, NY, USA: Marcel Dekker, Inc, 2000.

[11] J. H. M. Daniel Jurafsky, Speech and Language Processing, Prentice-Hall, Inc, 2000.

[12] M. K. K. T. S. K. Nakamura M., "Neural network approach to word category prediction for English texts," in Proceedings of the 13th Conference on Computational Linguistics (COLING 90), Prentice-Hall, Inc, 1990.

[13] J. G. ,. Y. Z. ,. X. C. ,. A. W. Jie Yang, "An Automatic Sign Recognition and Translation System," in PUI '01 Proceedings of the 2001 workshop on Perceptive user interfaces , New York, NY, USA, 2001.

[14] S. B.-G. a. Z. Z. Dragomir Radev, "Experiments in single and multi-document summarization using MEAD," in First Document Understanding Conference, New Orleans, LA, September 2001.

[15] A. A. A. a. L. Asker, "An Amharic Stemmer : Reducing Words to their Citation Forms," in In proceedings of Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic, June 2007.

[16] S. Dandapat, "Part-of-Speech Tagging for Bengali," Indian Institute of Technology, Kharagpur, 2011.

[17] Greene B. B. and Rubin G. M., "Automatic grammatical tagging of English," Technical Report, Department of Linguistics, Brown University., 1971.

[18] J. H. D. Jurafsky, Speech and Language Processing, Englewood Cliffs, New Jersey 07632: Prentice Hall, 1999.

[19] H. L. S. H. M. Robert Dale, Handbook of Natural Language, New York, NY, USA: Marcel Dekker, Inc, 2000.

[20] B. E., "A simple rule-based part-of-speech tagger," in In Proceedings of the 3rd Conference on Applied NLP, 1992.

[21] B. E., "Transformation-based error-driven learning and Natural Language Processing: A case study in part-of-speech tagging," Computational Linguistics,

vol. 21, no. 4, pp. 543-565, 1995a.

[22] B. E., "Unsupervised learning of disambiguation rules for part of speech tagging," in In Proceedings of 3rd Workshop on Very Large Corpora Workshop, Massachusetts, 1995b.

[23] L. H. Quỳnh, “So sánh một số phương pháp học máy cho bài toán gán nhãn từ loại tiếng Việt,” Luận văn cao học, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, Hà Nội, 2009.

[24] T. T. Oanh, “Mô hình tách từ, gán nhãn từ loại và hướng tiếp cận tích hợp cho tiếng Việt,” Luận văn cao học, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, Hà Nội, 2008.

[25] A. M. F. P. John Laferty, "Conditional Random Fields: Probabilistic Models for segmenting and labeling Sequence Data," in Proc. of the Eighteenth International Conference on Machine Learning (ICML-2001), 2001.

[26] J. D. M. G. M. M.-S. J. R. M.-B. a. A. J. S. Emilio Soria Olivas, Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, 2009.

[27] K. T. a. M. Y. Nakagawa T., "Unknown word guessing and part-of-speech tagging using support vector machines," In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 325-331, 2001.

[28] G. J. a. M. L., "Fast and accurate part-of-speech tagging: The SVM approach revisited," in In Proceedings of RANLP, 2003.

[29] J. F. L. J. M. R. a. R. S. Black E., "Decision tree models applied to the labeling of text with parts-of-speech.," in In Proceedings of the DARPA workshop on Speech and Natural Language, Harriman, New York., 1992.

[30] E. M. a. G. B., "Tagging experiment using neural networks.," in In Proceeding of the 9th Nordic Conference of Computational Linguistic, Sweden, 1994.

[31] M. Q. a. I. H., "A multi-neuro tagger using variable lengths of contexts.," in In Proceedings of the 17th international conference on Computational linguistics, Montreal, Quebec, Canada, 1998.

[32] K. J. H. a. K. G. C., "Fuzzy network model for part-of-speech tagging under small training data," Natural Language Engineering, vol. 2, no. 2, pp. 95-110, 1996. [33] Y. Z. T. L. a. S. L. Jinshan M., "A Statistical Dependency Parser of Chinese under

Small Training Data," 2004.

[34] A. M. a. M. Y., "Extended models and tools for high- performance part-of-speech tagger," in Proceedings of the 18th conference on Computational linguistics, Saarbrücken, Germany, 2000.

[35] H. M. a. M. Y., "Mistake-driven mixture of hierarchical tag context trees," in In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, Madrid, Spain, 1997.

[36] S. S. a. B. A. Dandapat S., " Automatic Part-of-Speech Tagging for Indian: An approach for Morphologically Rich Languages in a Poor Resource Scenario.," in

In Proceedings of the Association of Computational Linguistics (ACL ), Prague, Czech Re, 2007.

[37] B. C., "Unsupervised Natural Language Processing using Graph Models," in In Proceedings of the NAACL-HLT Doctoral Consortium, Rochester, 2007.

[38] D. S. a. N. V., "Unsupervised Part-of-Speech Acquisition from Resource-Scare Languages," in In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, 2007.

[39] M. R. S. S. G. K. a. B. P. Shrivastav M., "Conditional Random Field Based POS Tagger for Hindi," in In Proceedings of the MSPIL, Bombay, 2006.

[40] M. Y. a. T. J. Kazama J., "A maximum entropy tagger with unsupervised hidden markov models," in In Proceedings of the 6th NLPRS, 2001.

[41] S. G. a. P. F., "Aggregate and mixedorder Markov models for statistical language processing," in In Proceedings of the 2nd International Conference on Empirical Methods in Natural Language Processing., 1997.

[42] B. T, "TnT – A statistical part-of-sppech tagger," in In Proceedings of the 6th Applied NLP Conference, 2000.

[43] S. F. a. P. F., "Shallow parsing with conditional random fields," in In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, 2003.

[44] V. X. L. L. H. P. Nguyễn Thị Minh Huyền, “Sử dụng bộ gán nhãn từ loại xác suất QTAG cho văn bản Tiếng Việt,” trong Hội thảo ICT.rda, 2003.

[45] L. M. H. N. C. T. Phan Xuân Hiếu, “Gán nhãn từ loại tiếng Việt dựa trên các phương pháp học máy thống kê,” Hà nội, 2009.

[46] H. K. Dinh Dien, "POS-Tagger for English-Vietnamese Bilingual Corpus,"

Building and Using Parallel Texts Data Driven Machine Translation and Beyond,

pp. 88-95, 12 7 2003.

[47] D. S. a. N. V., "Unsupervised Part-of-Speech Acquisition from Resource-Scare Languages," in Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007. [48] M. P. M. M. A. a. S. Marcus, "Building a Large Annotated of English: The Penn

Treebank," Computation Linguistics, vol. 1, no. s, p. 1, 1993.

[49] C. K. W., "A stochastic parts program and noun phrase parser for unrestricted text," in Proceedings of the second conference on Applied Natural Language Processing, Austin, Texas, 1988.

[50] V. X. L. ,. N. T. M. H. Nguyễn Phương Thái, “Xây dựng treebank tiếng Việt,” Hanoi, 2008.

[51] A.Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," in Proc. Emparical Methods for Natural Language Processing, 1996.

[52] A. Ratnaparkhi., " A Maximum Entropy Model for Part-Of-Speech Tagging," in

Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP), University of Pennsylvania., 1996.

[53] R. R. S. R. Raymond Lau, "Adaptive language modeling using the maximum entropy principle," in HLT '93 Proceedings of the workshop on Human Language Technology, Stroudsburg, PA, USA, 1993.

[54] A. R. T. M. H. N. M. R. Phuong Le-Hong, "An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts," in Traitement Automatique des Langues Naturelles - TALN 2010, Montreal, Canada, 2010. [55] K. T. a. C. D. Manning, "Enriching the Knowledge Sources Used in a Maximum

Entropy Part-of-Speech Tagger," in Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 2000.

[56] E. Brill, "Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging," Computer Linguist, vol. 21, p. 543–565, December 1995.

[57] J. C. D. a. C. N. D. S. R. L. Milidiú, "TBL Template Selection: An Evolutionary Approach," in Current topics in artificial intelligence, Berlin, Springer-Verlag, 2007, p. 180–189.

[58] K. T. a. C. D. Manning, " Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger," in Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 2000.

[59] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," in IJCAI 14 (2), 1137-1145, 1995.

[60] A. S., Parsing by chunks, Kluwer Academic Publishers, 1991.

[61] A. S., Part-of-speech tagging and partial parsing, Kluwer, Dordrecht.: Ken Church, Steve Young, and Gerrit Bloothooft, editors, 1997.

[62] R. R. K. a. S. L. Arulmozhi P., "A Hybrid POS Tagger for a Relatively Free Word Order Language," in In Proceedings of the Modeling and Shallow Parsing of Indian Language (MSPIL), Bombay, 2006.

[63] Baum L. E., "An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process," Inequalities, vol. 3, pp. 1-8, 1972.

[64] N. V. J. Ide, "Introduction to the Special Issue on Word Sense Disambiguation,"

Mô tả các giải thuật trong mô hình

Quá trình huấn luyện mô hình kết hợp