Kết luận Chương 4

Chương 4 tập trung nghiởn cứu đề xuất vỏ cải tiến cõc kỹ thuật phõt hiện sao chờp õp dụng cho văn bản tiếng Việt bao gồm hai đề xuất mới vỏ cải tiến ba kỹ thuật đọ đề xuất trong Chương 2 vỏ Chương 3 của luận õn Đề xuất thứ nhất trớnh bỏy phương phõp xĩy dựng kho ngữ liệu tiếng Việt được sử dụng thử nghiệm, đõnh giõ cõc thuật tõn phõt hiện đoạn sao chờp cho văn bản tiếng Việt Đề xuất thứ hai trớnh bỏy kỹ thuật trợch rỷt từ khụa dựa trởn trọng số TF-IDF cụ xem xờt yếu tố từ loại õp dụng cho văn bản dỏi tiếng Việt Để cụ cơ sở cải tiến cõc kỹ thuật phõt hiện sao chờp văn bản tiếng Anh õp dụng cho tiếng Việt, luận õn đọ phĩn tợch sự ảnh hưởng của yếu tố ngừn ngữ trong mỗi bước xử lý từ đụ đề xuất cõc nội dung cải tiến cho kỹ thuật trợch rỷt từ khụa tớm tập ứng cử vỏ 02 kỹ thuật phõt hiện đoạn sao chờp

Cõc đụng gụp chợnh của chương nỏy gồm:

- Đề xuất giải phõp vỏ quy trớnh xĩy dựng kho ngữ liệu phõt hiện đoạn sao chờp tiếng Việt phục vụ thử nghiệm, đõnh giõ cõc thuật tõn phõt hiện đoạn sao chờp văn bản tiếng Việt

- Đề xuất phương phõp trợch rỷt từ khụa cho văn bản dỏi tiếng Việt dựa trởn trọng số TF-IDF ở mức tỏi liệu vỏ mức đoạn kết hợp với yếu tố từ loại

- Đề xuất cõc nội dung cải tiến cõc kỹ thuật trợch rỷt từ khụa vỏ phõt hiện đoạn sao chờp ứng dụng cho văn bản tiếng Việt dựa trởn phĩn tợch sự ảnh hưởng của yếu tố ngừn ngữ trong mỗi giao đoạn xử lý

KẾT LUẬN 1 Cõc kết quả nghiởn cứu của luận õn

Việc nghiởn cứu cõc kỹ thuật phõt hiện sao chờp thu hỷt được nhiều sự quan tĩm của cõc nhỏ nghiởn cứu trong vỏ ngoỏi nước Do vậy, luận õn đọ đề xuất hướng nghiởn cứu liởn quan đến lớp bỏi tõn nỏy Qua thời gian nghiởn cứu, thấy rằng cõc đề xuất liởn quan đến bỏi tõn phõt hiện sao chờp vẫn cún một số hạn chế như: cõc đề xuất giải quyết cõc trường hợp sao chờp cụ sự thay đổi chưa thực sự hiệu quả vỏ vấn đề ứng dụng cõc kỹ thuật phõt hiện sao chờp cho văn bản tiếng Việt cún nhiều hạn chế Chợnh vớ vậy, hướng nghiởn cứu của luận õn lỏ cần thiết Luận õn đọ đạt được mục tiởu lỏ đề xuất cõc kỹ thuật liởn quan đến bỏi tõn phõt hiện sao chờp toỏn cục, xĩy dựng cõc kho ngữ liệu tiếng Việt vỏ cải tiến cõc kỹ thuật đọ đề xuất thử nghiệm trởn kho ngữ liệu nỏy gụp phần khắc phục cõc hạn chế đọ nởu

Cõc kết quả của luận õn đạt được lỏ:

- Nghiởn cứu về bỏi tõn phõt hiện sao chờp toỏn cục; phĩn tợch, đõnh giõ ưu nhược điểm của cõc hướng nghiởn cứu liởn quan đến hai bỏi tõn thỏnh phần gồm bỏi tõn trợch rỷt từ khụa tớm tập tỏi liệu ứng cử vỏ bỏi tõn phõt hiện đoạn sao chờp

- Đọ đề xuất phương phõp trợch rỷt từ khụa tớm tập tỏi liệu ứng cử vỏ hai phương phõp phõt hiện đoạn sao chờp cho văn bản tiếng Anh Thực hiện thực nghiệm, so sõnh vỏ đõnh giõ hiệu quả của cõc phương phõp đề xuất so với cõc tiếp cận trởn thế giới liởn quan đến mỗi bỏi tõn

- Đọ đề xuất phương phõp trợch rỷt từ khụa cho văn bản dỏi tiếng Việt Cải tiến cõc kỹ thuật đọ đề xuất cho văn bản tiếng Anh ứng dụng cho văn bản tiếng Việt

- Đọ đề xuất giải phõp vỏ quy trớnh xĩy dựng kho ngữ liệu phõt hiện đoạn sao chờp tiếng Việt phục vụ thử nghiệm, đõnh giõ cõc thuật tõn phõt hiện sao chờp cho văn bản tiếng Việt

bỏi bõo vỏ kho ngữ liệu ĐATN sử dụng cho bỏi tõn trợch rỷt từ khụa tiếng Việt

2 Những đụng gụp mới của luận õn

- Đề xuất 2 kỹ thuật trợch rỷt từ khõ dựa trởn trọng số if-idf kết hợp với từ loại, vỏ dựa trởn kỹ thuật trợch rỷt đặc trưng vỏ mừ hớnh mạng nơ ron truyền thẳng FFNN

- Đề xuất 2 kỹ thuật phõt hiện đoạn sao chờp dựa trởn thuật tõn LDA kết hợp với thuật tõn tớm tập phổ biến Apriori vỏ kỹ thuật sử dụng mạng nơ ron học sĩu LSTM

- Xĩy dựng kho ngữ liệu đơn ngữ tiếng Việt cho bỏi tõn phõt hiện sao chờp văn bản

3 Hướng nghiởn cứu tiếp theo

Về lý thuyết: Tiếp tục phõt triển cõc kỹ thuật trợch rỷt từ khụa đạt hiệu quả cao hơn Tiếp tục nghiởn cứu cõc kỹ thuật phõt hiện đoạn sao chờp vỏ cõc độ đo tương đồng ngữ nghĩa, tập trung cõc giải phõp cho văn bản tiếng Việt

Về thực tiễn: Kết hợp cõc giải phõp trợch rỷt từ khụa, phõt hiện đoạn sao chờp theo cả hai hướng so khớp từ vỏ ngữ nghĩa để xĩy dựng ứng dụng phõt hiện sao chờp hoỏn chỉnh õp dụng trong thực tế

DANH MỤC CạC CễNG TRèNH KHOA HỌC Đẳ CễNG BỐ

[CT1] Le, H T , Pham, L N , Nguyen, D D , Nguyen, S V , & Nguyen, A N (2016), “Semantic text alignment based on topic modeling ” 2016 IEEE RIVF International Conference on Computing & Communication Technologies,

Research, Innovation, and Vision for the Future (RIVF) IEEE, 2016 pp 67-72, DOI: 10 1109/rivf 2016 7800271

[CT2] , Lở Thanh Hương, Nguyễn Chợ Thỏnh (2018), “Phương

phõp trợch rỷt từ khụa tớm tập ứng cử trong bỏi tõn phõt hiện đạo văn ” Tạp chợ Nghiởn cứu khoa học vỏ Cừng nghệ quĩn sự, số đặc san 11/2018, trang 27-35 [CT3] Nguyen Van Son, Le Thanh Huong, Nguyen Chi Thanh (2019),

“Construction monolingual vietnamese corpus for plagiarism detection” Tạp chợ Nghiởn cứu khoa học vỏ Cừng nghệ quĩn sự, số đặc san thõng 10/2019, trang 249-256 [CT4] Nguyen Van Son, Le Thanh Huong, Nguyen Chi Thanh (2020),

“Automatic keyword extraction using artificial neural network and feature

extraction” Tạp chợ Nghiởn cứu khoa học vỏ Cừng nghệ quĩn sự, Số 69A, thõng 11/2020, trang 63-74

[CT5] Nguyen Van Son, Le Thanh Huong, Nguyen Chi Thanh (2021), “A two- phase plagiarism detection system based on multi-layer LSTM Networks” IAES International Journal of Artificial Intelligence (IJ-AI)(Q2), Vol 10, No3,

TáI LIỆU THAM KHẢO

Tỏi liệu tiếng Việt:

1 2 3

Hoỏng Minh Bỳi (2020), Xõc định cĩu hỏi tương đồng trong hệ thống hỏi

đõp hỗ trợ tư vấn học tập, Đại học Bõch khoa Hỏ Nội

Lưu Tuấn Anh and Kazuhide Yamamoto (2012), "Ứng dụng phương phõp Pointwise vỏo bỏi tõn tõch từ cho tiếng Việt"

Tuấn Lưu Minh and Tĩn Hoỏng Minh (2021), "Một phương phõp kết hợp cõc mừ hớnh học sĩu vỏ kỹ thuật học tăng cường hiệu quả cho tụm tắt văn bản hướng trợch rỷt", TNU Journal of Science and Technology 226(11), pp 208-215

Tỏi liệu tiếng Anh:

4 5 6 7 8 9

Abnar Samira and et al (2014), "Expanded N-grams for semantic text alignment: Notebook for PAN at CLEF 2014", CEUR Workshop

Proceedings 1180, pp 928-938

Abrahamson Karl (1987), "Generalized String Matching", SIAM Journal on

Computing 16(6), pp 1039-1051

Agarwal Rakesh, Srikant Ramakrishnan and others (2000), Fast Algorithms For Mining Association Rules In Datamining, Fast Algorithms For Mining

Association Rules In Datamining, pp 13-24

Al-Hawawreh Muna and Sitnikova Elena (2019), Leveraging deep learning models for ransomware detection in the industrial internet of things

environment, 2019 Military Communications and Information Systems

Conference (MilCIS), IEEE, pp 1-6

Al-Hawawreh Muna, Sitnikova Elena and den Hartog Frank (2019), An efficient intrusion detection model for edge system in brownfield industrial Internet of Things, Proceedings of the 3rd International Conference on Big

Data and Internet of Things, pp 83-87

Al-Smadi Mohammad and et al (2017), "Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features", Information Processing & Management 53(3), pp 640-652

10 Allan James, Wade Courtney and Bolivar Alvaro (2003), Retrieval and Novelty Detection at the Sentence Level, Proceedings of the 26th annual

international ACM SIGIR conference on Research and development in informaion retrieval, pp 314-321

11 Alsallal Muna and et al (2017), An Integrated Machine Learning Approach for Extrinsic Plagiarism Detection, pp 203-208

12 Altheneyan Alaa Saleh and Menai Mohamed El Bachir (2020), "Automatic plagiarism detection in obfuscated text", Pattern Analysis and Applications 23(4), pp 1627-1650

13 Alvi Faisal, Stevenson Mark and Clough Paul (2015), "The short stories corpus", CEUR Workshop Proceedings 1391

14 Alzahrani Salha M , Salim Naomie and Abraham Ajith (2012),

"Understanding Plagiarism Linguistic Patterns, Textual Features, and

Detection Methods", IEEE Transactions on Systems, Man, and Cybernetics,

Part C (Applications and Reviews) 42(2), pp 133-149

15 Aquino Germõn and Lanzarini Laura (2015), "Keyword Identification in Spanish Documents using Neural Networks", Journal of Computer Science

and Technology (La Plata) 15(2), pp 55-60

16 Aronson A R and et al (2000), The NLM Indexing Initiative, Proc AMIA

Symp, pp 17-21

17 Augenstein Isabelle and et al (2017), "SemEval 2017 task 10: ScienceIE - Extracting keyphrases and relations from scientific publications", arXiv 18 Augenstein Isabelle and et al (2017), SemEval 2017 Task 10: ScienceIE -

Extracting Keyphrases and Relations from Scientific Publications,

Association for Computational Linguistics, Vancouver, Canada, 546-555 19 Baroni Marco, Dinu Georgiana and Kruszewski Germõn (2014), Don't count,

predict! A systematic comparison of context-counting vs context-predicting semantic vectors, Proceedings of the 52nd Annual Meeting of the Association

for Computational Linguistics, pp 238-247

20 Beliga Slobodan (2014), "Keyword extraction: a review of methods and approaches", University of Rijeka, Department of Informatics, Rijeka, pp 1- 9

21 Berry Thomas and Ravindran S (1999), A Fast String Matching Algorithm and Experimental Results, Stringology, pp 16-28

22 Blei David M , Ng Andrew Y and Jordan Michael T (2002), "Latent

dirichlet allocation", Advances in Neural Information Processing Systems 3, pp 993-1022

23 Borchers Oliver (2019), "Fast Sentence Embeddings", GitHub Repository 24 Bougouin Adrien and Boudin Florian (2013), TopicRank : Graph-Based

Topic Ranking for Keyphrase Extraction, International joint conference on

natural language processing (IJCNLP), pp 543-551

25 Boyer Robert S and Moore J Strother (1977), "A fast string searching algorithm", Communications of the ACM 20(10), pp 762-772

26 Brin Sergey and Page Lawrence (1998), "The anatomy of a large-scale hypertextual web search engine", Computer networks and ISDN systems 30(1-7), pp 107-117

27 Campos Ricardo and et al (2020), "YAKE! Keyword extraction from single documents using multiple local features", Information Sciences 509, pp 257-289

28 Cer Daniel and et al (2018), "Universal sentence encoder", arXiv preprint

arXiv:1803 11175

29 Ceska Zdenek (2008), Plagiarism detection based on singular value

decomposition, International Conference on Natural Language Processing, Springer, Berlin, Heidelberg, pp 108-119

30 Chowdhury Gobinda G (2010), "Introduction to modern information retrieval", Facet publishing

31 Conneau Alexis and et al (2017), "Supervised learning of universal sentence representations from natural language inference data", arXiv preprint

arXiv:1705 02364

32 Consortium BNC (2007), "British national corpus", Oxford Text Archive

Core Collection

33 De T C and et al (2014), Developing plagiarism detection system for Vietnamese University, 12th Vietnam—Japan International Joint Symposium, Can Tho

34 Devi Sobha Lalitha and et al (2010), "External Plagiarism Detection Lab Report for PAN at CLEF 2010"

35 Dietterich Thomas G (2010), Ensemble methods in machine learning,

International workshop on multiple classifier systems, Springer, Berlin,

Heidelberg, pp 1-15

36 Dolan William and et al (2004), Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,

Proceedings of the 20th international conference on Computational Linguistics, pp 350-356

37 Dreher Heinz (2007), "Automatic Conceptual Analysis for Plagiarism Detection", Issues in Informing Science and Information Technology 4(2007), pp 601-614

38 Dumais Susan T (2005), "Latent semantic analysis", Annual Review of

Information Science and Technology 38(1), pp 188-230

39 Eiselt Martin Potthast Benno Stein Andreas and Rosso Alberto Barrụn- Cedeno Paolo (2009), Overview of the 1st international competition on plagiarism detection, 3rd PAN Workshop Uncovering Plagiarism,

Authorship and Social Software Misuse, pp 1

40 Elizalde Victoria (2014), "Using Noun Phrases and tf-idf for Plagiarized Document Retrieval", CLEF (Working Notes)

41 Ferreira Rafael and et al (2018), "Combining sentence similarities measures to identify paraphrases", Computer Speech & Language 47, pp 59-73 42 Foltýnek Tomõš, Meuschke Norman and Gipp Bela (2020), "Academic

Plagiarism Detection", ACM Computing Surveys 52(6), pp 1-42 43 Francis W Nelson and Kucera Henry (1979), "Brown corpus manual",

Letters to the Editor 5(2), pp 7

44 Gaizauskas Robert and et al (2001), The METER corpus: a corpus for

analysing journalistic text reuse, Proceedings of the corpus linguistics 2001

conference

45 Gharavi Erfaneh and et al (2016), "A deep learning approach to Persian plagiarism detection", CEUR Workshop Proceedings 1737, pp 154-159 46 Gipp Bela and Beel Jữran (2010), Citation based plagiarism detection - A

Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, pp 273-274 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

Glinos Demetrios (2014), A hybrid architecture for plagiarism detection: Notebook for PAN at CLEF 2014, CLEF (working notes), pp 958-965 Gollapalli Sujatha Das and Caragea Cornelia (2014), Extracting keyphrases from research papers using citation networks, Proceedings of the AAAI

Conference on Artificial Intelligence, pp 1629-1635

Graves Alex, Mohamed Abdel Rahman and Hinton Geoffrey (2013), Speech recognition with deep recurrent neural networks, IEEE international

conference on acoustics, speech and signal processing, pp 6645-6649

Gross Philipp and Modaresi Pashutan (2014), Plagiarism alignment detection by merging context seeds: Notebook for PAN at CLEF 2014,

Proceedings of the Conference and Labs of the Evaluation Forum and Workshop (CLEF’14), pp 966-972

Hacohen-Kerner Yaakov (2003), Automatic extraction of keywords from abstracts, International Conference on Knowledge-Based and Intelligent

Information and Engineering Systems, Springer, Berlin, Heidelberg, pp 843-

849

Hariharan Shanmugasundaram and et al (2010), Detecting plagiarism in text documents, International Conference on Business Administration and

Information Processing, Springer, Berlin, Heidelberg, pp 497-500

Heaton Jeff (2008), Introduction to neural networks with Java, Heaton Research, Inc

Hoad Timothy C and Zobel Justin (2003), "Methods for identifying

versioned and plagiarized documents", Journal of the American Society for

Information Science and Technology 54(3), pp 203-215

Hochreiter Sepp and Schmidhuber Jýrgen (1997), "Long short-term memory", Neural computation 9(8), pp 1735-1780

Hulth Anette (2000), Improved automatic keyword extraction given more linguistic knowledge, Information retrieval, pp 216-223

Ide Nancy and Macleod Catherine (2001), The american national corpus: A standardized resource of american english, Proceedings of corpus

linguistics, Citeseer, pp 1-7

Irsoy Ozan and Cardie Claire (2014), Opinion mining with deep recurrent neural networks, Proceedings of the 2014 conference on empirical methods

in natural language processing (EMNLP), pp 720-728

Jaeger Herbert (2002), Tutorial on training recurrent neural networks,

covering BPPT, RTRL, EKF and the" echo state network" approach, Vol 5,

GMD-Forschungszentrum Informationstechnik Bonn

Joulin Armand and et al (2016), "Fasttext zip: Compressing text classification models", arXiv preprint arXiv:1612 03651

Kasprzak Jan and Brandejs Michal (2010), "Improving the reliability of the plagiarism detection system", Lab Report for PAN at CLEF, pp 359-366

62 Kim Su Nam and et al (2013), SemEval-2010 Task 5: Automatic keyphrase extraction from scientific articles, Proceedings of the 5th International

Workshop on Semantic Evaluation, pp 21-26

63 Kovačević Aleksandar and et al (2011), "Automatic extraction of metadata from scientific publications for CRIS systems", Program 45(4), pp 376- 396

64 Krapivin Mikalai (2008), "Large Dataset for Keyphrase Extraction",

Technical Report(May 2008)

65 Kraus Christina (2016), "Plagiarism Detection-State-of-the-art systems (2016) and evaluation methods", arXiv preprint arXiv:1603 03014 66 Kurtz Stefan (1999), "Reducing the space requirement of suffix trees",

Software: Practice and Experience 29(13), pp 1149-1171

67 Le Quoc and Mikolov Tomas (2014), Distributed representations of

sentences and documents, International conference on machine learning, PMLR, pp 1188-1196

68 Luong Nguyen Hien and Oanh Nguyen Thi (2015), A Copy Detection Method Based on SCAM and PPCHECKER, Proceedings of the Sixth

International Symposium on Information and Communication Technology,

pp 111-115

69 Luu Tuan Minh, Le Huong Thanh and Hoang Tan Minh (2021), "A HYBRID MODEL USING THE PRETRAINED BERT AND DEEP NEURAL

NETWORKS WITH RICH FEATURE FOR EXTRACTIVE TEXT

SUMMARIZATION", Journal of Computer Science and Cybernetics 37(2), pp 123-143

70 Lyu Boer and et al (2021), "Let: Linguistic knowledge enhanced graph transformer for chinese short text matching", arXiv preprint arXiv:2102 12671

71 Mahata Debanjan and et al (2018), Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings, Proceedings of

the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 634-639

72 Marujo Luợs and et al (2012), "Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference

normalization", Proceedings of the 8th International Conference on

Language Resources and Evaluation, LREC 2012, pp 399-403

73 Marujo Luợs, Viveiros Mõrcio and Neto Joọo P (2011), "Keyphrase cloud generation of Broadcast News", Proceedings of the Annual Conference of

the International Speech Communication Association, INTERSPEECH, pp

2393-2396

74 McCabe Donald L (2005), "Cheating among college and university students: A North American perspective", International Journal for Educational

Integrity 1(1)

75 Medelyan Olena, Frank Eibe and Witten Ian H (2009), Human-competitive

Giới thiệu mạng nơ ron hồi quy RNN

Giới thiệu mạng LSTM xếp chồng