Luận án tiến sĩ công nghệ thông tin nghiên cứu nhận dạng thực thể có tên và thực thể biểu hiện trong văn bản và ứng dụng

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG LUẬN ÁN TIẾN SĨ CÔNG NGHỆ THÔNG TIN Hà Nội – 2018 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG Chuyên ngành: Hệ thống thông tin Mã số: 62.48.05.01 LUẬN ÁN TIẾN SĨ CÔNG NGHỆ THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh Hà Nội – 2018 LỜI CAM ĐOAN Tôi xin cam đoan cơng trình nghiên cứu riêng tơi Các kết viết chung với tác giả khác đồng ý đồng tác giả trước đưa vào luận án Các kết nêu luận án trung thực chưa cơng bố cơng trình khác Tác giả Trần Mai Vũ LỜI CẢM ƠN Luận án thực Bộ môn Hệ thống thông tin - Khoa Công nghệ thông tin - Trường Đại học Công nghệ - Đại học Quốc gia Hà Nội, hướng dẫn khoa học PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh Trước tiên xin bày tỏ lòng biết ơn sâu sắc tới thầy PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh, người đưa đến với lĩnh vực nghiên cứu Các thầy tận tình giảng dạy, hướng dẫn giúp tơi tiếp cận đạt thành công công việc nghiên cứu Các thầy ln tận tâm động viên, khuyến khích dẫn giúp tơi hồn thành luận án Tôi xin bày tỏ lòng biết ơn tới Thầy Cơ thuộc Khoa Cơng nghệ thơng tin cán Phòng Đào tạo - Trường Đại học Công nghệ, tạo điều kiện thuận lợi giúp đỡ tơi q trình học tập nghiên cứu trường Tôi xin cảm ơn PGS TS Nigel Collier cộng đóng góp ý kiến q báu giúp tơi hồn thiện luận án Sự động viên, cổ vũ bạn bè nguồn động lực quan trọng để tơi hồn thành luận án Tơi xin bày tỏ lòng biết ơn sâu sắc tới gia đình, vợ tơi tạo điểm tựa vững cho tơi có thành cơng ngày hôm Tác giả Trần Mai Vũ MỤC LỤC LỜI CAM ĐOAN LỜI CẢM ƠN MỤC LỤC DANH MỤC CÁC KÍ HIỆU VÀ CHỮ VIẾT TẮT DANH MỤC CÁC BẢNG DANH MỤC CÁC HÌNH VẼ, ĐỒ THỊ 10 MỞ ĐẦU 11 Lý chọn đề tài 11 Mục tiêu cụ thể phạm vi nghiên cứu luận án 12 Cấu trúc luận án 15 Chương - KHÁI QUÁT VỀ NHẬN DẠNG THỰC THỂ 1.1 Một số khái niệm 17 17 1.1.1 Định nghĩa toán nhận dạng thực thể 17 1.1.2 Thách thức 19 1.1.3 Độ đo đánh giá 19 1.1.4 Ứng dụng nhận dạng thực thể 21 1.2 Sơ lược lịch sử nghiên cứu số hướng giải toán 22 1.3 Nhận dạng thực thể liệu văn tiếng Việt số nghiên cứu liên quan 24 1.3.1 Những thách thức xử lý liệu tiếng Việt 24 1.3.2 Động nghiên cứu 26 1.3.3 Các nghiên cứu liên quan 26 1.4 Nhận dạng thực thể liệu văn y sinh tiếng Anh số nghiên cứu liên quan 29 1.4.1 Những thách thức xử lý liệu y sinh 29 1.4.2 Động nghiên cứu 30 1.4.3 Các nghiên cứu liên quan 31 1.5 Tổng kết chương 34 Chương – NHẬN DẠNG THỰC THỂ TÊN NGƯỜI KẾT HỢP VỚI NHẬN DẠNG THUỘC TÍNH THỰC THỂ CĨ TÊN TRONG VĂN BẢN TIẾNG VIỆT 36 2.1 Giới thiệu 36 2.2 Các nghiên cứu liên quan 38 2.2.1 Các nghiên cứu liên quan giới 38 2.2.2 Các nghiên cứu liên quan Việt Nam 39 2.3 Một mơ hình giải toán nhận dạng thực thể tên người kết hợp với nhận dạng thuộc tính thực thể 40 2.3.1 Mơ hình Entropy cực đại giải mã tìm kiếm chùm (MEM+BS) 40 2.3.2 Phương pháp trường ngẫu nhiên có điều kiện (CRF) 41 2.3.3 Mơ hình đề xuất 42 2.3.4 Tập đặc trưng 46 2.4 Thực nghiệm, kết đánh giá 47 2.4.1 Công cụ liệu đánh giá 47 2.4.2 Kết thực nghiệm đánh giá toàn hệ thống 49 2.4.3 Kết thực nghiệm đánh giá nhãn 50 2.5 Mơ hình áp dụng vào hệ thống hỏi đáp tên người tiếng Việt 52 2.5.1 Khái quát toán 52 2.5.2 Đặc trưng câu hỏi liên quan đến thực thể tên người tiếng Việt 53 2.5.3 Mơ hình đề xuất 55 2.5.4 Phương pháp liệu đánh giá mơ hình hỏi đáp tự động 61 2.5.6 Thực nghiệm đánh giá 61 2.6 Tổng kết chương 64 Chương – NHẬN DẠNG THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN Y SINH TIẾNG ANH 66 3.1 Giới thiệu 66 3.1.1 Động khái quát toán nhận dạng thực thể biểu 66 3.1.2 Một số khái niệm liên quan đến thực thể biểu số thực thể liên quan 69 3.1.3 Vấn đề thích nghi miền nhận dạng thực thể y sinh 74 3.2 Mơ hình nhận dạng thực thể biểu số thực thể liên quan 75 3.2.1 Cơ sở lý thuyết 76 3.2.2 Dữ liệu đánh giá tài ngun hỗ trợ 77 3.2.3 Mơ hình đề xuất 82 3.2.4 Tập đặc trưng đánh giá đặc trưng 84 3.2.5 Phương pháp đánh giá 88 3.3 Thực nghiệm 89 3.3.1 Thực nghiệm 1: đánh giá hiệu mô hình đề xuất với kỹ thuật học máy khác 89 3.3.2 Thực nghiệm 2: so sánh kết mơ hình đề xuất với số nghiên cứu liên quan 90 3.3.3 Thực nghiệm 3: đánh giá đóng góp tài nguyên kết nhận diện thực thể 94 3.3.4 Thực nghiệm 4: ứng dụng mơ hình đề xuất để nhận dạng thực thể y sinh thi BioCreAtIvE V CDR Task 95 3.4 Thích nghi miền liệu nhận dạng thực thể y sinh 97 3.4.1 Thực nghiệm 98 3.4.2 Kết đánh giá 99 3.5 Tổng kết chương 101 Chương – MỘT MƠ HÌNH NÂNG CẤP HIỆU QUẢ NHẬN DẠNG THỰC THỂ Y SINH DỰA TRÊN KỸ THUẬT LAI GHÉP VÀ HỌC XẾP HẠNG 103 4.1 Mơ hình nâng cấp nhận dạng thực thể biểu thực thể liên quan 103 4.2 Các phương pháp lai ghép đề xuất 105 4.2.1 Phương pháp lai ghép sử dụng luật 105 4.2.2 Phương pháp lai ghép sử dụng học máy gán nhãn chuỗi 108 4.2.3 Phương pháp lai ghép sử dụng học xếp hạng 109 4.3 Thực nghiệm đánh giá kết 111 4.3.1 Phương pháp đánh giá 111 4.3.2 Thực nghiệm đánh giá hiệu phương pháp lai ghép 112 4.3.3 Thực nghiệm kiểm thử tin cậy trình đánh giá hiệu tài nguyên 114 4.3.4 Thảo luận phân tích lỗi 115 4.4 Kết luận chương 118 KẾT LUẬN 120 DANH MỤC CƠNG TRÌNH KHOA HỌC CỦA TÁC GIẢ CÓ LIÊN QUAN ĐẾN LUẬN ÁN 122 TÀI LIỆU THAM KHẢO 123 DANH MỤC CÁC KÍ HIỆU VÀ CHỮ VIẾT TẮT Kí hiệu Tiếng Anh Tiếng Việt NER Named Entity Recognition Nhận dạng thực thể định danh NLP Natural Language Processing Xử lý ngôn ngữ tự nhiên BioNLP Biomedical Natural Language Xử lý ngôn ngữ tự nhiên cho Processing liệu y sinh IE Information Extraction Trích xuất thơng tin CRF Conditional Random Fields Trường ngẫu nhiên có điều kiện SVM Support Vector Machine Máy véctơ hỗ trợ SVM-LTR SVM-Learn to rank Học xếp hạng máy véctơ hỗ trợ ME Model, Maximum Entropy Model Maxent Model MEM+BS Maximum Entropy with Beam Search Mơ hình Entropy cực đại Model Mơ hình Entropy cực đại với giải mã tìm kiếm chùm DANH MỤC CÁC BẢNG Bảng 2.1 Một ví dụ trích chọn thực thể tên người thuộc tính liên quan 37 Bảng 2.2 Các nhãn sử dụng mơ hình 43 Bảng 2.3 Tập đặc trưng sử dụng 46 Bảng 2.4 Thống kê thực thể tập liệu gán nhãn 48 Bảng 2.5 Kết đánh giá toàn hệ thống hai mơ hình với hai phương pháp MEM+BS CRF 49 Bảng 2.6 Kết thực nghiệm nhãn 51 Bảng 2.7 Ví dụ số thành phần câu hỏi 56 Bảng 2.8 Các thành phần xuất câu hỏi thực thể tên người 57 Bảng 2.9 Ví dụ gán nhãn tổng quát cho câu hỏi thực thể tên người tiếng Việt 58 Bảng 2.10 Thống kê tập liệu câu hỏi đánh giá 61 Bảng 2.11 Kết đánh giá thành phần phân tích câu hỏi 62 Bảng 2.12 Kết đánh giá hệ thống trả lời tự động 63 Bảng 3.1 Danh sách bệnh tự miễn dịch sử dụng để xây dựng liệu Phenominer A 78 Bảng 3.2 Các đặc điểm liệu Phenominer A bệnh tự miễn dịch Phenominer B bệnh tim mạch 80 Bảng 3.3 Các đặc trưng sử dụng thực nghiệm 84 Bảng 3.4 Thực nghiệm so sánh phương pháp học máy khác 90 Bảng 3.5 Thực nghiệm so sánh mơ hình đề xuất hệ thống khác 92 Bảng 3.6 Kết đánh giá tài nguyên mơ hình nhận dạng thực thể 94 Bảng 3.7 Thống kê ba tập liệu nhiệm vụ CDR [WPL15] 96 Bảng 3.8 Kết mơ hình nhận dạng tập liệu kiểm thử 96 Bảng 3.9 Kết F1 hệ thống NER sử dụng phương pháp thực nghiệm 1-6 99 Bảng 4.1 Các đặc trưng MEM + BS sử dụng để định kết 109 Bảng 4.2 Kết mơ hình tập liệu Phenominer A sử dụng phương pháp khác để lai ghép kết 112 DANH MỤC CƠNG TRÌNH KHOA HỌC CỦA TÁC GIẢ CĨ LIÊN QUAN ĐẾN LUẬN ÁN [CTLA1] Nigel Collier, Ferdinand Paster, Mai-Vu Tran (2014) The impact of near domain transfer on biomedical named entity recognitions LOUHI 2014, EACL 2014, Sweden, 2014 [CTLA2] Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Quang-Thuy Ha, Anika Oellrich, Dietrich Rebholz-Schuhmann (2013) Learning to Recognize Phenotype Candidates in the Auto-Immune Literature Using SVM Re-Ranking PLoS ONE 8(10): e72965, October 2013 [CTLA3] Mai-Vu Tran, Duc-Trong Le (2013) vTools: Chunker and Part-ofSpeech tools, RIVF-VLSP 2013 Workshop [CTLA4] Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin Hall-May, Dietrich Rebholz-Schuhmann (2012) A Hybrid Approach to Finding Phenotype Candidates in Genetic Texts, COLING 2012: 647-662 [CTLA5] Mai-Vu Tran, Duc-Trong Le, Xuan-Tu Tran and Tien-Tung Nguyen (2012) A Model of Vietnamese Person Named Entity Question Answering System, PACLIC 2012, Bali, Indonesia, October 2012 [CTLA6] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha (2011) An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text IALP 2011:115-118 [CTLA7] Hoang-Quynh Le, Mai-Vu Tran, Thanh Hai Dang, Nigel Collier (2015) The UET-CAM System in the BioCreAtIvE V CDR Task In Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain, 2015 122 TÀI LIỆU THAM KHẢO Tiếng Việt [DH96] Diệp Quang Ban (chủ biên), Hoàng Văn Thung (1996), Ngữ pháp tiếng Việt T1, T2 - NXB Giáo dục- HN [NTH11] Nguyễn Thanh Hiên (2011) Phân giải nhập nhằng thực thể có tên dựa ontology đóng mở Luận án tiến sỹ Trường Đại học Bách Khoa, Đại học Quốc Gia TP.HCM [SC13] Sam Chanrathany (2013) Trích rút thực thể có tên quan hệ thực thể văn tiếng Việt Luận án tiến sỹ Trường Đại học Bách Khoa Hà Nội Tiếng Anh [AHB93] Appelt, D E., Hobbs, J R., Bear, J., Israel, D., & Tyson, M (1993, August) FASTUS: A finite-state processor for information extraction from realworld text In IJCAI (Vol 93, pp 1172-1178) [AZ05] Ando, R K., & Zhang, T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data The Journal of Machine Learning Research, 6, 1817-1853 [AZ11b] A B Abacha and P Zweigenbaum Medical entity recognition: A comparison of semantic and statistical methods In Proceedings of BioNLP 2011 Workshop, pages 56–64, 2011 [AZ12] Aggarwal, C C., & Zhai, C (2012) Mining text data Springer Science & Business Media [BBD02] Banko, M., Brill, E., Dumais, S., & Lin, J (2002, March) AskMSR: Question answering using the worldwide Web In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases (pp 7-9) [BPP96] Berger, A L., Pietra, V J D., & Pietra, S A D (1996) A maximum entropy approach to natural language processing Computational linguistics, 22(1), 39-71 123 [BR04] Bard, J B., & Rhee, S Y (2004) Ontologies in biology: design, applications and future challenges Nature Reviews Genetics, 5(3), 213-222 [BSS03] Blake, A., Sinclair, M T., & Sugiyarto, G (2003) Quantifying the impact of foot and mouth disease on tourism and the UK economy Tourism Economics,9(4), 449-465 [BSS08] Beisswanger, E., Schulz, S., Stenzhorn, H., & Hahn, U (2008) BioTop: An upper domain ontology for the life sciencesA description of its current structure, contents and interfaces to OBO ontologies Applied Ontology, 3(4), 205212 [CC03] Curran, J R., & Clark, S (2003, May) Language independent NER using a maximum entropy tagger In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume (pp 164-167) Association for Computational Linguistics [CC09] Cai, Y., & Cheng, X (2009, October) Biomedical named entity recognition with tri-training learning In Biomedical Engineering and Informatics, 2009 BMEI'09 2nd International Conference on (pp 1-5) IEEE [COG15] Collier, N., Oellrich, A., & Groza, T (2015) Concept selection for phenotypes and diseases using learn to rank Journal of biomedical semantics, 6(1), 24 [CF04] Chen, L., & Friedman, C (2004) Extracting phenotypic information from the literature via natural language processing Medinfo, 11(Pt 2), 758-62 [CGE11] Cohen, R., Gefen, A., Elhadad, M., & Birk, O S (2011) CSIOMIM-Clinical Synopsis Search in OMIM BMC bioinformatics, 12(1), 65 [COG13] Collier, N., Oellrich, A., & Groza, T (2013) Toward knowledge support for analysis and interpretation of complex traits Genome biology, 14(9), 214 [CTX06] Cam-Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le Minh Nguyen, and Quang Thuy Ha: Vietnamese Word Segmentation with CRFs and 124 SVMs: An Investigation, The 20th Pacific Asia Conference on Language, Information, and Computation (PACLIC), 1st-3rd November, 2006, Wuhan, China [CH08] Cohen, K B., & Hunter, L (2008) Getting started in text mining PLoS computational biology, 4(1), e20 [DA07] H Daume III 2007 Frustratingly easy domain adaptation In Annual meeting of the Association for Computational Linguistics (ACL 2007), pages 256– 263 [DCX12] Doan, S., Collier, N., Xu, H., Duy, P H., & Phuong, T M (2012) Recognition of medication information from discharge summaries using ensembles of classifiers BMC medical informatics and decision making, 12(1), 36 [DDS09] Nguyen, D Q., Nguyen, D Q., & Pham, S B (2009, October) A vietnamese question answering system In Knowledge and Systems Engineering, 2009 KSE'09 International Conference on (pp 26-32) IEEE [DMP04] Doddington, G R., Mitchell, A., Przybocki, M A., Ramshaw, L A., Strassel, S., & Weischedel, R M (2004, May) The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation In LREC [ES13] Ekbal, A., & Saha, S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction Knowledge-Based Systems, 46, 22-32 [EUL01] Eduard Hovy, Ulf Hermjakob and Lin, C.-Y The Use of External Knowledge in Factoid QA Paper presented at the Tenth Text REtrieval Conference (TREC 10), Gaithersburg, MD, 2001, November 13-16 [FEO02] K Franzén, G Eriksson, F Olsson, L Asker, P Lidén, and J Coster Protein names and how to find them International Journal of Medical Informatics, 67(1-3):49–61, 2002 [FIJ03] Florian, R., Ittycheriah, A., Jing, H and Zhang, T (2003) Named Entity Recognition through Classifier Combination Proceedings of CoNLL-2003 Edmonton, Canada [FPS96] Fayyad, Piatetsky-Shapiro, Smyth From Data Mining to Knowledge Discovery: An Overiew In Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, 125 Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, Menlo Park, 1996, 1-34 [FS03] Freimer, N., & Sabatti, C (2003) The human phenome project Nature genetics, 34(1), 15-21 [FTT98] Fukuda, K I., Tsunoda, T., Tamura, A., & Takagi, T (1998, January) Toward information extraction: identifying protein names from biological papers In Pac Symp Biocomput (Vol 707, No 18, pp 707-718) [GCS11] Gremse, M., Chang, A., Schomburg, I., Grote, A., Scheer, M., Ebeling, C., & Schomburg, D (2011) The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources Nucleic acids research, 39(suppl 1), D507-D513 [GFH08] Danilo Giampiccolo, Pamela Forner, Jesús Herrera, Anselmo Peñas, Christelle Ayache, Corina Forascu, Valentin Jijkoun, Petya Osenova, Paulo Rocha, Bogdan Sacaleanu, Richard F E Sutcliffe (2008) Overview of the clef 2007 multilingual question answering track In Advances in Multilingual and Multimodal Information Retrieval (pp 200-236) Springer Berlin Heidelberg [GKD15] Groza, T., Köhler, S., Doelken, S., Collier, N., Oellrich, A., Smedley, D., & Robinson, P N (2015) Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora Database, 2015 [GHZ12] Groza, T., Hunter, J., & Zankl, A (2012) Supervised segmentation of phenotype descriptions for the human skeletal phenome using hybrid methods.BMC bioinformatics, 13(1), 265 [GHZ13] Groza, T., Hunter, J., & Zankl, A (2013) Decomposing phenotype descriptions for the human skeletal phenome Biomedical informatics insights, 6, [GLR06] Giuliano, C., Lavelli, A., & Romano, L (2006, April) Exploiting shallow linguistic information for relation extraction from biomedical literature In EACL (Vol 18, pp 401-408) 126 [GNB10] Gerner, M., Nenadic, G., & Bergman, C M (2010) LINNAEUS: a species name identification system for biomedical literature BMC bioinformatics, 11(1), 85 [GR08] Girju R Semantic relation extraction and its applications ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 [GZH12] Groza, T., Zankl, A., & Hunter, J (2012) Experiences with modeling composite phenotypes in the SKELETOME project In The Semantic Web–ISWC 2012 (pp 82-97) Springer Berlin Heidelberg [HBK12] Hirschman, L., Burns, G A C., Krallinger, M., Arighi, C., Cohen, K B., Valencia, A., & Winter, A G (2012) Text mining for the biocuration workflow Database, 2012, bas020 [HC03] W.-J Hou and H.-H Chen Enhancing performance of protein name recognizers using collocation In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine Volume 13, pages 25–32, 2003 [HEG00] Hovy, Eduard and Gerber, Laurie and Hermjakob, Ulf and Junk, Michael and Lin, Chin-yew (2000) Question answering in webclopedia In Proceedings of the Ninth Text REtrieval Conference (TREC-9) [HHH12] Hoehndorf, R., Harris, M A., Herre, H., Rustici, G., & Gkoutos, G V (2012) Semantic integration of physiology phenotypes with an application to the Cellular Phenotype Ontology Bioinformatics, 28(13), 1783-1789 [HL15] Huang, C C., & Lu, Z (2015) Community challenges in biomedical text mining over 10 years: success, failure and the future Briefings in bioinformatics, bbv024 [HOR10] Hoehndorf, R., Oellrich, A., & Rebholz-Schuhmann, D (2010) Interoperability between phenotype and anatomy ontologies Bioinformatics, 26(24), 3112-3118 [HSG11] Hoehndorf, R., Schofield, P N., & Gkoutos, G V (2011) PhenomeNET: a whole-phenome approach to disease gene discovery Nucleic acids research,39(18), e119-e119 127 [HSS09] Hettne, K M., Stierum, R H., Schuemie, M J., Hendriksen, P J., Schijvenaars, B J., Van Mulligen, E M., & Kors, J A (2009) A dictionary to identify small molecules and drugs in free text Bioinformatics, 25(22), 2983-2991 [HWY05] Huang, J., Wang, C., Yang, C., Chiu, M and Yee, G 2005 Applying Word Sense Disambiguation to Question Answering System for ELearning In Proceedings of the 19th International Conference on Advanced Information Networking and Applications Taipei, Taiwan, pp.157-62 [JAJ10] Javier Artiles, Andrew Borthwick, Julio Gonzalo, Satoshi Sekine, and Enrique Amigó WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks in the 3rd Web People Search Evaluation Workshop (WePS 2010) [Kai08] Kaisser, M (2008, June) The QuALiM question answering demo: Supplementing answers with paragraphs drawn from Wikipedia In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session (pp 32-35) Association for Computational Linguistics [KCO05] S Kinoshita, K B Cohen, P Ogren, and L Hunter BioCreAtIvE task 1A: Entity identification with a stochastic tagger BMC Bioinformatics, 6(Suppl 1):S4, 2005 [KLR15] Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., & Valencia, A (2015) CHEMDNER: The drugs and chemical names extraction challenge J Cheminform, 7(Suppl 1), S1 [KM14] Khordad, Maryam (2014) Investigating Genotype-Phenotype relationship extraction from biomedical text Doctoral dissertation University of Western Ontario [KMR11] Khordad, M., Mercer, R E., & Rogan, P (2011) Improving phenotype name recognition In Advances in Artificial Intelligence (pp 246-257) Springer Berlin Heidelberg 128 [KOT03] Kim, J D., Ohta, T., Tateisi, Y., & Tsujii, J I (2003) GENIA corpus—a semantically annotated corpus for bio-textmining Bioinformatics, 19(suppl 1), i180-i182 [KOT04] Kim, J D., Ohta, T., Tsuruoka, Y., Tateisi, Y., & Collier, N (2004, August) Introduction to the bio-entity recognition task at JNLPBA In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (pp 70-75) Association for Computational Linguistics [LDN13] Le, N M., Do, B N., Nguyen, V D., & Nguyen, T D (2013, December) VNLP: an open source framework for Vietnamese natural language processing InProceedings of the Fourth Symposium on Information and Communication Technology (pp 88-93) ACM [LLL14] Le Trung, H., Le Anh, V., & Le Trung, K (2014) Bootstrapping and Rule-Based Model for Recognizing Vietnamese Named Entity In Intelligent Information and Database Systems (pp 167-176) Springer International Publishing [LMP01] Lafferty, J., McCallum, A., & Pereira, F C (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data [LN10] Le, H T., & Nguyen, T H (2010, August) Name entity recognition using inductive logic programming In Proceedings of the 2010 Symposium on Information and Communication Technology (pp 71-77) ACM [LTC04] Lin, Y F., Tsai, T H., Chou, W C., Wu, K P., Sung, T Y., & Hsu, W L (2004, August) A maximum entropy approach to biomedical named entity recognition In BIOKDD (pp 56-61) [LV13] Le, H T., & Van Tran, L (2013, December) Automatic feature selection for named entity recognition using genetic algorithm In Proceedings of the Fourth Symposium on Information and Communication Technology (pp 8187) ACM [MAC07] Mabee, P M., Ashburner, M., Cronk, Q., Gkoutos, G V., Haendel, M., Segerdell, E., & Westerfield, M (2007) Phenotype ontologies: the bridge between genomics and evolution Trends in ecology & evolution, 22(7), 345-350 129 [MC07] McKusick, V A (2007) Mendelian Inheritance in Man and its online version, OMIM American journal of human genetics, 80(4), 588 [MFM05] Mitsumori, T., Fation, S., Murata, M., Doi, K., & Doi, H (2005) Gene/protein name recognition based on support vector machine using dictionary as features BMC bioinformatics, 6(Suppl 1), S8 [MFP00] McCallum, A., Freitag, D., & Pereira, F C (2000, June) Maximum Entropy Markov Models for Information Extraction and Segmentation In ICML (pp 591-598) [MHC04] A A Morgan, L Hirschman, M Colosimo, A S Yeh, and J B Colombe Gene name identification and normalization using a model organism database Journal of Biomedical Informatics, 37(6):396–410, 2004 [ML03] McCallum, A., & Li, W (2003, May) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons InProceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume (pp 188-191) Association for Computational Linguistics [MO08] Michele Banko, Oren Etzioni “The Tradeoffs Between Open and Traditional Relation Extraction ACL 2008: 28-36 [MPH03] Moldovan, D., Paşca, M., Harabagiu, S., & Surdeanu, M (2003) Performance issues and error analysis in an open-domain question answering system ACM Transactions on Information Systems (TOIS), 21(2), 133-154 [MR04] Mika, S., & Rost, B (2004) Protein names precisely peeled off free text Bioinformatics, 20(suppl 1), i241-i247 [MY14] Miwa, Makoto, and Yutaka Sasaki "Modeling Joint Entity and Relation Extraction with Table Representation." EMNLP 2014 [NBK13] Nédellec, C., Bossy, R., Kim, J D., Kim, J J., Ohta, T., Pyysalo, S., & Zweigenbaum, P (2013, August) Overview of BioNLP shared task 2013 In Proceedings of the BioNLP Shared Task 2013 Workshop (pp 1-7) 130 [NC12] Nguyen, T T., & Cao, T H (2012, February) Linguistically Motivated and Ontological Features for Vietnamese Named Entity Recognition In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on (pp 1-6) IEEE [NCT99] C Nobata, N Collier, and J.-i Tsujii Automatic term identification and classification in biology texts In Proceedings of the Natural Language Pacific Rim Symposium, pages 369–374, 1999 [NE05] Nédellec, C (2005, August) Learning language in logic-genic interaction extraction challenge In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol 7) [NN13] Nguyen, M T., & Nguyen, T T (2013, December) Extraction of disease events for a real-time monitoring system In Proceedings of the Fourth Symposium on Information and Communication Technology (pp 139-147) ACM [NP12] Nguyen, D B., & Pham, S B (2012) Ripple down rules for vietnamese named entity recognition In Computational Collective Intelligence Technologies and Applications (pp 354-363) Springer Berlin Heidelberg [NRV03] M Narayanaswamy, K E Ravikumar, and K Vijay-Shanker A biological named entity recognizer In Pacific Symposium on Biocomputing, pages 427–438, 2003 [NHP10] Nguyen, D B., Hoang, S H., Pham, S B., & Nguyen, T P (2010) Named entity recognition for Vietnamese In Intelligent Information and Database Systems (pp 205-214) Springer Berlin Heidelberg [OCQ09] Oanh Thi Tran, Cuong Anh Le Quang-Thuy Ha and Quynh Hoang Le An Experimental Study on Vietnamese POS tagging", International Conference on Asian Language Processing (IALP 2009):23-27, Dec 7-9, 2009, Singapore [OMT06] D Okanohara, Y Miyao, Y Tsuruoka, and J Tsujii Improving the scalability of semi-Markov conditional random fields for named entity recognition In Proceedings of the 21st International Conference on Computational Linguistics 131 and the 44th Annual Meeting of the Association for Computational Linguistics, pages 465–472, 2006 [OOG05] Özgür, A., Özgür, L., & Güngör, T (2005) Text categorization with class-based and corpus-based keyword selection In Computer and Information Sciences-ISCIS 2005 (pp 606-615) Springer Berlin Heidelberg [PGH07] Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., & Salakoski, T (2007) BioInfer: a corpus for information extraction in the biomedical domain BMC bioinformatics, 8(1), 50 [PNH10] Phan, T T., Nguyen, T C., & Huynh, T N (2010) Question semantic analysis in Vietnamese QA system In Advances in Intelligent Information and Database Systems (pp 29-40) Springer Berlin Heidelberg [PY10] Pan, S J., & Yang, Q (2010) A survey on transfer learning Knowledge and Data Engineering, IEEE Transactions on, 22(10), 1345-1359 [QU93] Quinlan, J R (1993) C4 5: programs for machine learning (Vol 1) Morgan kaufmann [RA89] Rabiner, L (1989) A tutorial on hidden Markov models and selected applications in speech recognition Proceedings of the IEEE, 77(2), 257-286 [RA91] Rau, L F (1991, February) Extracting company names from text In Artificial Intelligence Applications, 1991 Proceedings., Seventh IEEE Conference on(Vol 1, pp 29-32) IEEE [RA96] Ratnaparkhi, A (1996, May) A maximum entropy model for part-ofspeech tagging In Proceedings of the conference on empirical methods in natural language processing (Vol 1, pp 133-142) [RHT10] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, The Minh Trinh Relation Extraction in Vietnamese Text Using Conditional Random Fields AAIRS 2010: 330-339 [RM95] L A Ramshaw and M P Marcus Text chunking using transformation-based learning In 3rd ACL SIGDAT Workshop on Very Large Corpora, pages 82–94, 1995 132 [RR09] Ratinov, L., & Roth, D (2009) Design challenges and misconceptions in named entity recognition In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp 147-155) Association for Computational Linguistics [SCW09] Scheuermann, R H., Ceusters, W., & Smith, B (2009) Toward an ontological treatment of disease and diagnosis Summit on translational bioinformatics,2009, 116 [SE04] Settles, B (2004, August) Biomedical named entity recognition using conditional random fields and rich feature sets In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (pp 104-107) Association for Computational Linguistics [SE09] Smith, C L., & Eppig, J T (2009) The mammalian phenotype ontology: enabling robust annotation and comparative analysis Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3), 390-399 [SGE04] Smith, C L., Goldsmith, C A W., & Eppig, J T (2004) The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information Genome biology, 6(1), R7 [SJ09] Satoshi Sekine and Javier Artiles WePS2 Attribute Extraction Task in the 2nd Web People Search Evaluation Workshop (WePS 2, 2009) [SLT11a] Sam, R C., Le, H T., Nguyen, T T., & Nguyen, T H (2011) Combining proper name-coreference with conditional random fields for semisupervised named entity recognition in Vietnamese text In Advances in Knowledge Discovery and Data Mining (pp 512-524) Springer Berlin Heidelberg [SLT11b] Sam, R C., Le, H T., Nguyen, T T., Le, D A., & Nguyen, N M T (2011, October) Semi-supervised learning for relation extraction in Vietnamese text In Proceedings of the Second Symposium on Information and Communication Technology (pp 100-105) ACM [SMY15] Sun, H., Ma, H., Yih, W T., Tsai, C T., Liu, J., & Chang, M W (2015, May) Open Domain Question Answering via Semantic Enrichment In 133 Proceedings of the 24th International Conference on World Wide Web (pp 10451055) International World Wide Web Conferences Steering Committee [SOK13] Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield, M., Robinson, P., & Mungall, C (2013) PhenoDigm: analyzing curated annotations to associate animal models with human diseases Database, 2013, bat025 [SSM09] S K Saha, S Sarkar, and P Mitra Feature selection techniques for maximum entropy based biomedical named entity recognition Journal of Biomedical Informatics, vol 42, no 5, pp 905–911, 2009 [STM08] Y Sasaki, Y Tsuruoka, J McNaught, and S Ananiadou How to make the most of NE dictionaries in statistical NER BMC Bioinformatics, 9(Suppl 11):S5, 2008 [TC05] K Takeuchi and N Collier Bio-medical entity extraction using support vector machines Artificial Intelligence in Medicine, 33(2):125–137, 2005 [TLH10] Tran Thi Oanh, Le Cuong Anh, Ha Thuy Quang, Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources Journal of Natural Language Processing 17(3): 41-60 (2010) [TOH05] Tu, N C., Oanh, T T., Hieu, P X., & Thuy, H Q (2005) Named entity recognition in vietnamese free-text and web documents using conditional random fields In The 8th Conference on Some selection problems of Information Technology and Telecommunication [TTD07] Thao, P T X., Tri, T Q., Dien, D., & Collier, N (2007) Named entity recognition in Vietnamese using classifier voting ACM Transactions on Asian Language Information Processing (TALIP), 6(4), [TTK05] Tsuruoka, Y., Tateishi, Y., Kim, J D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J I (2005) Developing a robust part-of-speech tagger for biomedical text In Advances in informatics (pp 382-392) Springer Berlin Heidelberg 134 [TTQ07] Tran, Q T., Pham, T T., Ngo, Q H., Dinh, D., & Collier, N (2007) Named entity recognition in Vietnamese documents Progress in Informatics Journal,5, 14-17 [TWC06] Tzong-Han Tsai, Richard; Wu S.-H.; Chou, W.-C.; Lin, Y.-C.; He, D.; Hsiang, J.; Sung, T.-Y.; Hsu, W.-L 2006 Various Criteria in the Evaluation of Biomedical Named Entity Recognition BMC Bioinformatics 7:92, BioMed Central [UCO11] Y Usami, H.-C Cho, N Okazaki, and J Tsujii Automatic acquisition of huge training data for bio-medical named entity recognition In Proceedings of BioNLP 2011 Workshop, pages 65–73, 2011 [USC10] Uzuner, Ö., Solti, I., & Cadag, E (2010) Extracting medication information from clinical text Journal of the American Medical Informatics Association,17(5), 514-518 [USS10] Uzuner, Ö., South, B R., Shen, S., & DuVall, S L (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text Journal of the American Medical Informatics Association [VA10] Vlachos, A (2010) Semi-supervised learning for biomedical information extraction Doctoral dissertation Computer Laboratory, University of Cambridge [VED01] Voorhees, Ellen M., and Donna Harman Overview of TREC 2001 Trec 2001 [Vo03] E.M Voorhees Overview of the TREC 2003 Question Answering Track TREC 2003: 54-68 [VVO09] Vu Mai Tran, Vinh Duc Nguyen, Oanh Thi Tran, Uyen Thu Thi Pham, Thuy Quang Ha An Experimental Study of Vietnamese Question Answering System In Proceedings of IALP'2009 pp.152~155 [WAC12] Wu, C H., Arighi, C N., Cohen, K B., Hirschman, L., Krallinger, M., Lu, Z., & Wilbur, W J (2012) BioCreative-2012 Virtual Issue Database: The Journal of Biological Databases and Curation, 2012 135 [WGM14] West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., & Lin, D (2014, April) Knowledge base completion via search-based question answering In Proceedings of the 23rd international conference on World wide web (pp 515526) ACM [WKS09] Wang, Y., Kim, J D., Sætre, R., Pyysalo, S., & Tsujii, J I (2009) Investigating heterogeneous protein annotations toward cross-corpora utilization BMC bioinformatics, 10(1), 403 [WPL15] Wei, C H., Peng, Y., Leaman, R., Davis, A P., Mattingly, C J., Li, J., & Lu, Z (2015) Overview of the BioCreative V chemical disease relation (CDR) task In Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain [WTJ13] Wagholikar, K B., Torii, M., Jonnalagadda, S., & Liu, H (2013) Pooling annotated corpora for clinical concept extraction J Biomedical Semantics, 4, [YD14] Yao, X., & Van Durme, B (2014) Information extraction over structured data: Question answering with freebase In Proceedings of ACL [YYW15] Yang, Y., Yih, W T., & Meek, C (2015) WIKIQA: A Challenge Dataset for Open-Domain Question Answering In Proceedings of the Conference on Empirical Methods in Natural Language Processing [ZD09] Zweigenbaum, P., & Demner-Fushman, D (2009) Advanced literature-mining tools In Bioinformatics (pp 347-380) Springer New York [ZDY07] Zweigenbaum, P., Demner-Fushman, D., Yu, H., & Cohen, K B (2007) Frontiers of biomedical text mining: current progress Briefings in bioinformatics, 8(5), 358-375 [ZSZ05] G Zhou, D Shen, J Zhang, J Su, and S Tan Recognition of protein/gene names from text using an ensemble of classifiers BMC Bioinformatics, 6(Suppl 1):S7, 2005 136 ... HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG Chuyên ngành: Hệ thống thông tin Mã số: 62.48.05.01 LUẬN ÁN TIẾN SĨ CÔNG NGHỆ THÔNG... toán nhận dạng thực thể văn tiếng Việt hai mục tiêu nghiên cứu luận án 1.3.3 Các nghiên cứu liên quan Nhận dạng thực thể tiếng Việt nhận nhiều quan tâm cộng đồng nghiên cứu nước nhà nghiên cứu. .. tiềm nghiên cứu điểm qua vài ứng dụng bật nhận dạng thực thể 1.1 Một số khái niệm 1.1.1 Định nghĩa toán nhận dạng thực thể Bài toán nhận dạng thực thể (hay gọi tốn nhận dạng thực thể định danh;

Định dạng
Số trang	138
Dung lượng	3,08 MB