Luận án tiến sĩ khai phá luồng văn bản với kỹ thuật gom cụm

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC LẠC HỒNG VÕ THỊ HỒNG THẮM KHAI PHÁ LUỒNG VĂN BẢN VỚI KỸ THUẬT GOM CỤM LUẬN ÁN TIẾN SĨ KHOA HỌC MÁY TÍNH Đồng Nai, năm 2021 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC LẠC HỒNG VÕ THỊ HỒNG THẮM KHAI PHÁ LUỒNG VĂN BẢN VỚI KỸ THUẬT GOM CỤM LUẬN ÁN TIẾN SĨ KHOA HỌC MÁY TÍNH Chuyên ngành: Khoa học máy tính Mã số: 9480101 NGƯỜI HƯỚNG DẪN KHOA HỌC PGS.TS ĐỖ PHÚC Đồng Nai, năm 2021 LỜI CẢM ƠN Xin chân thành cảm ơn PGS.TS Đỗ Phúc tận tình hướng dẫn nghiên cứu sinh hồn thành luận án tiến sĩ Xin chân thành cảm ơn quý thầy/cô khoa sau đại học, trường đại học Lạc Hồng tạo điện kiện thuận lợi hỗ trợ nghiên cứu sinh hoàn thành luận án Xin trân trọng cảm ơn trường đại học Thủ Dầu Một hỗ trợ nghiên cứu sinh tham gia học tập trường đại học Lạc Hồng Xin chân thành cám ơn quý bạn bè, đồng nghiệp tạo điều kiện giúp đỡ nghiên cứu sinh hoàn thành luận án Nghiên cứu sinh - Võ Thị Hồng Thắm LỜI CAM ĐOAN Tôi xin cam đoan luận án cơng trình nghiên cứu riêng hướng dẫn PGS.TS Đỗ Phúc Các số liệu tài liệu nghiên cứu trung thực chưa công bố cơng trình nghiên cứu Tất tham khảo kế thừa trích dẫn tham chiếu đầy đủ Đồng Nai, ngày … tháng năm 2021 Nghiên cứu sinh Võ Thị Hồng Thắm MỤC LỤC CHƯƠNG 1: GIỚI THIỆU 1.1 Tổng quan đề tài luận án 1.1.1 Bài toán nghiên cứu ý nghĩa .1 1.1.2 Thách thức toán gom cụm luồng văn 1.1.3 Các vấn đề nghiên cứu 1.1.4 Các toán nghiên cứu 1.2 Đóng góp luận án cơng trình cơng bố 10 1.3 Mục tiêu, phạm vi phương pháp nghiên cứu 11 1.3.1 Mục tiêu nghiên cứu 11 1.3.2 Phạm vi nghiên cứu .12 1.3.3 Phương pháp nghiên cứu .12 1.4 Cấu trúc luận án 13 1.5 Kết chương 13 CHƯƠNG 2: CÁC NGHIÊN CỨU LIÊN QUAN 14 2.1 So sánh số cách tiếp cận liên quan đến gom cụm luồng văn 14 2.1.1 Phương pháp tiếp cận dựa mơ hình chủ đề truyền thống 14 2.1.2 Phương pháp tiếp cận dựa mơ hình hỗn hợp động 15 2.1.3 Phương pháp tiếp cận dựa biểu diễn không gian vectơ 16 2.1.4 Mơ hình hóa chủ đề (Topic modeling) 16 2.1.5 Mơ hình hỗn hợp dựa quy trình Dirichlet (DPMM) 23 2.1.6 Đồ thị phổ biến .32 2.1.7 Mơ hình hóa bật luồng văn Kleinberg 35 2.2 Kết chương 40 CHƯƠNG 3: GOM CỤM LUỒNG VĂN BẢN THEO NGỮ NGHĨA DỰA TRÊN ĐỒ THỊ TỪ 41 3.1 Phương pháp 41 3.1.1 Biểu diễn đặt trưng văn phương pháp túi từ (BOW) .41 3.1.2 Biểu diễn văn đồ thị từ (GOW) .43 3.1.3 Gom cụm luồng văn dựa mơ hình hỗn hợp 49 3.2 Thực nghiệm bàn luận 62 3.3 Kết chương 74 CHƯƠNG 4: PHÁT HIỆN CỤM TỪ XU THẾ TRÊN LUỒNG VĂN BẢN 75 4.1 Phương pháp 75 4.2 Thực nghiệm bàn luận 88 4.3 Kết chương 103 CHƯƠNG 5: KẾT LUẬN & HƯỚNG PHÁT TRIỂN 104 5.1 Các kết đạt được, hạn chế hướng phát triển 104 5.2 Ý nghĩa học thuật thực tiễn luận án 106 BẢNG THUẬT NGỮ ANH – VIỆT Tiếng Anh Viết tắt Tiếng Việt Allocation Dirichlet Latent LDA Phân bổ tiềm ẩn Direntlet Bag of Word BOW Túi từ Benchmark Đối sánh Cluster validation Xác nhận cụm Common sub GOWs Đồ thị phổ biến Concept/topic drift Dịng trơi khái niệm/chủ đề Corpus Kho ngữ liệu Density-based Dựa mật độ Dirichlet Process DP Quy trình Dirichlet Dirichlet-Hawkes Topic Model DHTM Mơ hình chủ đề Dirichlet-Hawkes Document batch Lơ tài liệu Dynamic Clustering Topic DCT Mơ hình chủ đề gom cụm động Dynamic Topic Model DTM Mơ hình chủ đề động Features of meaning Đặc trưng ngữ nghĩa Filtering Lọc Frequent sub-graph FSG Đồ thị phổ biến Graph of Word GOW Đồ thị từ Microblogs Bài viết ngắn dạng blog Model’s hyper-parameter sensitivity Độ nhạy siêu tham số mơ hình (viết ngắn độ nhạy) Mstream MStream Thuật tốn gom cụm luồng liệu dựa mơ hình hỗn hợp DP Noise Yếu tố nhiễu Outlier Ngoại lệ Politeness Độ sâu Preprocess Tiền xử lý Proximity measure Đo lường lân cận Sequence Monte Carlo SMC Tuần tự Monte Carlo Sparse nature Tính rời rạc tự nhiên Sparsity of text Sự rời rạc văn Stemming and Lemmatization Trả từ nguyên mẫu Stop word Từ dừng Streaming LDA Survey ST-LDA Streaming LDA Khảo sát Tiếng Anh Viết tắt Tiếng Việt Temporal Dynamic Process Model TDPM Mơ hình hỗn hợp quy trình Dirichlet theo thời gian Temporal model-LDA TM-LDA Mơ hình LDA theo thời gian Temporal Text Mining TTM Khai phá văn theo thời gian Term Frequency TF Tần số từ Term Frequency-Invert Document TF-IDF Frequency Tần số từ -Tần số tài liệu nghịch đảo Text corpus Tập văn Text similarity Sự tương tự văn Text to Graph Text2graph Đồ thị hóa văn Trendy Keyword Extraction System TKES Hệ thống rút trích từ khóa tiêu biểu Tokenization Tách từ Topic tracking model TTM mô hình theo dõi chủ đề Vector Space model VSM Mơ hình khơng gian vectơ Visualize Hiển thị trực quan Word relatedness Sự liên quan từ Word segmentation Tách từ Word similarity Sự tương tự từ Word vector Véc tơ từ DANH MỤC BẢNG Bảng 1.1: Phân tích điểm mạnh tồn mơ hình .7 Bảng 3.1: Biểu diễn văn với BOW truyền thống 42 Bảng 3.2: Biểu diễn văn với BOW TF-IDF .42 Bảng 3.3: Biểu diễn văn với GOW 48 Bảng 3.4: Biểu diễn văn kết hợp BOW GOW 49 Bảng 3.5: Biểu diễn véc tơ chủ đề mơ hình GOW-Stream 62 Bảng 3.6: Chi tiết liệu thử nghiệm 64 Bảng 3.7: Chi tiết cấu hình cho mơ hình gom cụm luồng văn 66 Bảng 3.8: Kết đầu trung bình tác vụ gom cụm văn với mơ hình khác với độ đo NMI 67 Bảng 3.9: Kết đầu thử nghiệm tác vụ gom cụm văn với mơ hình khác với độ đo F1 67 Bảng 4.1: Các thuộc tính nút mối quan hệ 80 Bảng 4.2: Một ví dụ tính toán số xếp hạng từ 82 Bảng 4.3: Một ví dụ tính tổng trọng số từ khóa chuyên mục 83 Bảng 4.4: Thí dụ cấu trúc lưu trữ Burst 87 Bảng 4.5: Các Burst từ khóa “Facebook” .89 Bảng 4.6: Xác định danh sách từ xu chung với từ khóa “Facebook” 90 Bảng 4.7: Thử nghiệm thời gian thực thi thu thập thông tin .91 Bảng 4.8: Kiểm tra thời gian thực thi việc thêm liệu vào sở liệu đồ thị 91 Bảng 4.9: Kiểm tra thời gian chạy xử lý 91 Bảng 4.10: Thời gian xử lý số lượng viết khác với độ dài khác 92 Bảng 4.11: Tỷ lệ giống liệu sinh từ thuật tốn TF-IDF viết ngơn ngữ lập trình khác 93 Bảng 4.12: Tần số từ khóa 94 Bảng 4.13: Một số tham số với word2Vec 95 Bảng 4.14: Các từ liên quan đến từ khóa “Ứng dụng” 96 Bảng 4.15: So sánh mức độ tương đồng sử dụng thước đo khoảng cách tương đồng khác 96 Bảng 4.16: Thời gian huấn luyện mơ hình 97 Bảng 4.17: Thời gian xử lý để tìm 10 từ liên quan .98 Bảng 4.18: Kiểm tra thời gian xử lý phát Burst báo 19 ngày 100 103 Tập liệu báo bao gồm báo thu thập xếp thư mục theo cấu trúc Ngày/Chuyên mục/Bài báo dạng tập tin văn Tên tập tin tiêu đề viết tập tin chứa thông tin bao gồm tiêu đề, mơ tả nội dung (Hình 4.14) Tập liệu báo sau qua bước tiền xử lý có cấu trúc tương tự với tập liệu báo Điều khác biệt nội dung báo tiền xử lý cách tách từ loại bỏ từ dừng (Hình 4.15) Hình 4.15: Cấu trúc lưu trữ liệu Hình 4.16: Cấu trúc lưu trữ danh qua xử lý sách từ khóa hàng đầu viết Tập liệu từ khóa hàng đầu viết lưu tập tin văn có cấu trúc trường bao gồm: ngày (Date), mã báo (PaperID), từ khóa (KeyWord) tần số (Weight) (Hình 4.16) Tập liệu từ khóa hàng đầu chuyên mục có cấu trúc tương tự với tập liệu từ khóa hàng đầu báo 104 Hình 4.17: Cấu trúc lưu trữ danh sách từ khóa hàng đầu chuyên mục Tập liệu lưu dạng tập tin văn với trường bao gồm: ngày (Date), mã chuyên mục (Category), từ khóa (KeyWord) số xếp hạng (Rank) (Hình 4.17) Trên số định dạng, cấu trúc số tập liệu Hệ thống hồn tồn linh hoạt đáp ứng việc cấu trúc liệu cho phù hợp theo yêu cầu liệu thực nghiệm nghiên cứu 4.3 Kết chương Chương trình bày phương pháp, kết thực nghiệm bàn luận nghiên cứu giải toán thứ hai, nghiên cứu tìm cụm từ xu luồng liệu văn Nghiên cứu đề xuất hệ thống TKES áp dụng thuật tốn đề xuất AdaptingBurst tìm cụm từ xu dựa ý tưởng thuật toán trước Kleinbergn Các thuật toán đề xuất giải vấn đề phát bật, tính tốn, xếp hạng từ tìm bật tiêu biểu Nghiên cứu hỗ trợ kết xuất tập liệu để phục vụ nghiên cứu sâu Ngoài ra, hướng phát triển hệ thống hướng đến xử lý, tính tốn song song để tăng tốc độ Nghiên cứu sinh dự kiến hướng phát triển sử dụng độ đo đánh giá hiệu suất mơ hình đề xuất vận dụng kết từ nghiên cứu vào gom cụm luồng văn chẳng hạn cải tiến biểu diễn đặc trưng văn gom cụm Bên cạnh đó, nghiên cứu này, bước tiền xử lý liệu, rút trích từ khóa, rút trích từ khóa tương đồng phục vụ cho việc tìm cụm từ xu trình bày chi tiết thử nghiệm tính toán thời gian xử lý, so sánh thời gian xử lý độ xác kết 105 CHƯƠNG 5: KẾT LUẬN & HƯỚNG PHÁT TRIỂN Chương tổng kết kết đạt được, tập trung làm rõ tốn giải vấn đề mà đề tài luận án đặt Chương đánh giá lại nội dung: nghiên cứu cơng trình khoa học liên quan, phương pháp đặt giải 02 toán luận án Với kỹ thuật, giải pháp đề xuất, mô tả tốn, phương pháp, thực nghiệm, điểm mạnh, tính mới/tính cải tiến liên tục, điểm điểm hạn chế hướng phát triển làm rõ Chương chia thành 02 nội dung là: Mục 5.1 đánh giá kết đạt được, hạn chế hướng phát triển, Mục 5.2 đánh giá ý nghĩa học thuật thực tiễn luận án 5.1 Các kết đạt được, hạn chế hướng phát triển Phần nghiên cứu tổng quan luận án đã: Lược sử cơng trình liên quan đến hướng nghiên cứu đề tài tốn đặt từ giúp mang lại nhìn tổng thể vấn đề nghiên cứu; Tìm hiểu kỹ thuật tảng vấn đề nghiên cứu; Phân tích điểm mạnh yếu nghiên cứu liên quan từ định tốn phương án giải quyết; So sánh giải pháp có sử dụng cách tiếp cận từ tìm ưu điểm hạn chế giải pháp; Cập nhật liên tục thời điểm nghiên cứu liên quan từ thấy phát triển liên tục hướng nghiên cứu; Hoạt động nhóm nghiên cứu đầu ngành, chuyên gia tiếng cộng đồng nghiên cứu thuộc lĩnh vực nghiên cứu liên quan theo dõi nêu rõ Có thể nói, tốn tốn luận án thể đóng góp quan trọng, cụ thể sau:  Đề xuất cách tiếp cận gom cụm luồng văn dựa mơ hình hỗn hợp, áp dụng đánh giá đồ thị từ (GOW) xuất tập ngữ liệu văn cho  Thực đánh giá mối quan hệ từ suy cụm  Đề xuất cách tiếp cận áp dụng văn n-gram vào đồ thị hóa văn (text2graph) với kỹ thuật khai phá đồ thị phổ biến (FSM) để rút trích đồ thị phổ biến từ kho ngữ liệu văn cho  Sử dụng kỹ thuật rút trích đồ thị phổ biến tài liệu văn để hỗ trợ q trình ước tính phân phối chủ đề tài liệu 106  Xử lý hiệu tác vụ gom cụm luồng văn ngắn cách kết hợp đánh giá từ độc lập (các từ riêng biệt tài liệu) từ phụ thuộc (các từ xuất đồ thị phổ biến)  Kết hợp đánh giá dựa đồ thị phổ biến đánh giá từ cách độc lập trình suy luận chủ đề mơ hình hỗn hợp quy trình Dirichlet (DPMM) để nâng cao kết gom cụm văn từ luồng liệu  Giải thách thức liên quan đến thay đổi chủ đề tự nhiên luồng văn cải thiện độ xác và thời gian xử lý gom cụm so với mơ hình dựa đánh giá độc lập từ trước so sánh hiệu GOW-Stream với thuật toán đại gần đây, như: DTM, Sumblr Mstream Điểm mạnh GOW-Stream có hiệu suất tốt thuật toán đại công bố gần như: DTM, Sumblr Mstream GOW-Stream, có thời gian xử lý gom cụm tốt, nhiên phải tốn thời gian cho trình đồ thị hóa văn tìm đồ thị phổ biến Hướng phát triển đề nghị là: xem xét tối ưu hóa mơ biểu diễn văn dạng đồ thị phức tạp hơn, áp dụng số cách biểu diễn đặc trưng xem xét thêm ngữ nghĩa thời gian, lấy kết từ nghiên cứu phát cụm từ xu vào cải tiến biểu diễn đặc trưng văn bản; Xem xét phương pháp khác để biểu diễn mối quan hệ từ văn bản; Xem xét mở rộng việc triển khai mơ hình GOW-Stream mơi trường xử lý phân tán chủ yếu thiết kế để xử lý luồng liệu dạng văn quy mô lớn tốc độ cao, chẳng hạn Apache Spark Streaming Ngoài ra, mơ hình đề xuất sử dụng để cải thiện hiệu suất ứng dụng khai thác văn khác, chẳng hạn phân định từ ngữ (word sense disambiguation) [84], khai thác bình luận [101] nhiệm vụ theo chuỗi thời gian [34] Hơn nữa, nhiều nghiên cứu gần áp dụng hiệu học sâu (deep learning) vào cải thiện kết gom cụm [6, 19, 38, 40, 71, 86, 90, 96], [21, 24, 30, 31, 37, 41, 61, 62, 72, 77, 79, 80, 88, 89, 93, 94] Thiết nghĩ hướng phát triển cho luận án Bài tốn thứ đề xuất hệ thống TKES với đóng góp đề xuất thuật tốn phát bật từ khóa dựa thuật tốn Kleinberg, thuật tốn chứng minh tính hiệu tin tưởng ứng dụng vào nhiều lĩnh vực Cụ thể nghiên cứu đề xuất thuật toán phát bật, cụm từ xu thế, bật tiêu biểu Để xây dựng hệ thống TKES, luận án sử dụng TF-IDF để tìm từ khóa, sử dụng mạng Nơ ron để huấn luyện mơ hình tìm tập từ khóa tương đồng, sử dụng mơ hình Skip-gram, độ đo so sánh độ tương đồng Cosine, Euclidean, Manhattan, Minkowski, Jaccard, kỹ thuật tiền xử lý liệu văn tiếng Việt Các 107 kết thực nghiệm nghiên cứu bao gồm: tính tốn thời gian xử lý, so sánh thời gian xử lý giải pháp tập liệu khác nhau; Thu thập tập liệu nguồn kết xuất kết thành tập liệu phục vụ cho nghiên cứu liên quan Hướng phát triển đề xuất sau: Nghiên cứu, cấu trúc lại tập liệu theo dạng chuẩn chung để cơng bố; Hồn thiện đáp ứng u cầu người dùng vào nhiều tảng khác Smart phone, Web …để đáp ứng triển khai thực tiễn; Sử dụng kết nghiên cứu phát cụm từ xu để nâng cao hiệu mô hình GOWStream việc nắm bắt thêm xu hướng từ văn đến từ luồng thực gom cụm 5.2 Ý nghĩa học thuật thực tiễn luận án Về học thuật, luận án đề xuất mơ hình Mơ hình GOW-Stream thể tính ưu việt so sánh với thuật toán đại gần Hệ thống TKES có đóng góp đề xuất thuật tốn phát cụm từ xu có tiềm ứng dụng vào việc tối ưu hóa mơ hình GOW-Stream đề xuất Các cơng trình nghiên cứu luận án gồm 04 báo hội nghị quốc tế (Springer/ACM) 02 báo tạp chí quốc tế (01 thuộc Scopus-Q3 01 thuộc SCIE-Q3) Về thực tiễn, mơ hình, thuật tốn đề xuất ứng dụng nhiều lĩnh vực, hệ thống xây dựng có ý nghĩa thực tiễn cao, phục vụ nhu cầu khai phá thông tin đông đảo người dùng thời đại cách mạng công nghiệp 4.0 DANH MỤC CÁC BÀI BÁO ĐÃ CÔNG BỐ Bốn báo hội nghị công bố:  [CT1] Hong, T V T., & Do, P (2018, February) Developing a graph-based system for storing, exploiting and visualizing text stream In Proceedings of the 2nd International Conference on Machine Learning and Soft Computing (pp 8286) (https://dl.acm.org/doi/abs/10.1145/3184066.3184084)  [CT2] Hong, T.V.T and Do, P., 2018, October SAR: A Graph-Based System with Text Stream Burst Detection and Visualization In International Conference on Intelligent Computing & Optimization (pp 35-45) Springer, Cham (https://link.springer.com/chapter/10.1007/978-3-030-00979-3_4)  [CT3] Hong, T.V.T and Do, P., 2019, October A Novel System for Related Keyword Extraction over a Text Stream of Articles In International Conference on Intelligent Computing & Optimization (pp 409-419) Springer, Cham (https://link.springer.com/chapter/10.1007/978-3-030-33585-4_41)  [CT4] Hong, T.V.T and Do, P., 2019, October Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites In International Conference on Intelligent Computing & Optimization (pp 379-388) Springer, Cham (https://link.springer.com/chapter/10.1007/978-3030-33585-4_38) Hai báo tạp chí (chỉ mục Scopus/SCIE) chấp nhận đăng:  [CT5] Hong, Tham Vo Thi, and Phuc Do “TKES: A Novel System for Extracting Trendy Keywords from Online News Sites” In: Journal of the Operations Research Society of China (ISSN: 2194-6698) (Scopus indexed) (https://www.springer.com/journal/40305) (Scopus http://link.springer.com/article/10.1007/s40305-020-00327-4) Q3,  [CT6] Hong, Tham Vo Thi, and Phuc Do “GOW-Stream: a novel approach of graph-of-words based mixture model for semantic-enhanced text stream clustering” In: Intelligent Data Analysis (ISSN: 1571-4128) (https://www.iospress.nl/journal/intelligent-data-analysis) (SCIE Q3, accepted for publication – 2020, September) TÀI LIỆU THAM KHẢO Agarwal Neha, Sikka Geeta, and Awasthi Lalit Kumar, Evaluation of web service clustering using Dirichlet Multinomial Mixture model based approach for Dimensionality Reduction in service representation Information Processing & Management, 2020 57(4): p 102238 Aggarwal Charu C, A Survey of Stream Clustering Algorithms, in Data Clustering: Algorithms and Applications, C.K.R Charu C Aggarwal, Editor 2013, CRC Press p 229-253 Aggarwal Charu C, et al A framework for clustering evolving data streams in Proceedings 2003 VLDB conference 2003 Elsevier Ahmed Amr and Xing Eric Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering in Proceedings of the 2008 SIAM International Conference on Data Mining 2008 SIAM Aldous David J, Exchangeability and related topics, in École d'Été de Probabilités de Saint-Flour XIII—1983 1985, Springer p 1-198 Aljalbout Elie, et al., Clustering with deep learning: Taxonomy and new methods arXiv preprint arXiv:1801.07648, 2018 Alrehamy Hassan and Walker Coral, Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction Soft Computing, 2018 22(21): p 7041-7057 Alzaidy Rabah, Caragea Cornelia, and Giles C Lee Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents in The world wide web conference 2019 Amoualian Hesam, et al Streaming-lda: A copula-based approach to modeling topic dependencies in document streams in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016 10 Antonellis Panagiotis, et al., Efficient Algorithms for Clustering Data and Text Streams, in Encyclopedia of Information Science and Technology, Third Edition 2015, IGI Global p 1767-1776 11 Bakkum Douglas J, et al., Parameters for burst detection Frontiers in computational neuroscience, 2014 7: p 193 12 Beliga Slobodan, Meštrović Ana, and Martinčić-Ipšić Sanda, Selectivity-based keyword extraction method International Journal on Semantic Web and Information Systems (IJSWIS), 2016 12(3): p 1-26 13 Bicalho Paulo, et al., A general framework to expand short text for topic modeling Information Sciences, 2017 393: p 66-81 14 Blei David M and Lafferty John D Dynamic topic models in Proceedings of the 23rd international conference on Machine learning 2006 15 Blei David M, Ng Andrew Y, and Jordan Michael I, Latent Dirichlet Allocation Journal of machine Learning research, 2003 3(Jan): p 993-1022 16 Cai Yanli and Sun Jian-Tao, Text Mining, in Encyclopedia of Database Systems, L Liu and M.T ÖZsu, Editors 2009, Springer US: Boston, MA p 3061-3065 17 Cami Bagher Rahimpour, Hassanpour Hamid, and Mashayekhi Hoda, User preferences modeling using dirichlet process mixture model for a content-based recommender system Knowledge-Based Systems, 2019 163: p 644-655 18 Cao Feng, et al Density-based clustering over an evolving data stream with noise in Proceedings of the 2006 SIAM international conference on data mining 2006 SIAM 19 Chen Gang, Deep learning with nonparametric clustering arXiv preprint arXiv:1501.03084, 2015 20 Chen Junyang, Gong Zhiguo, and Liu Weiwen, A Dirichlet process biterm-based mixture model for short text stream clustering Applied Intelligence, 2020: p 111 21 Curiskis Stephan A, et al., An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit Information Processing & Management, 2020 57(2): p 102034 22 Darling William M A theoretical and practical implementation tutorial on topic modeling and gibbs sampling in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies 2011 23 Du Nan, et al Dirichlet-hawkes processes with applications to clustering continuous-time document streams in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015 24 Duan Tiehang, et al Sequential embedding induced text clustering, a nonparametric bayesian approach in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2019 Springer 25 Erkan Günes and Radev Dragomir R, Lexrank: Graph-based lexical centrality as salience in text summarization Journal of Artificial Intelligence Research, 2004 22: p 457-479 26 Ferguson Thomas S, A Bayesian analysis of some nonparametric problems The annals of statistics, 1973: p 209-230 27 Finegan-Dollak Catherine, et al Effects of creativity and cluster tightness on short text clustering performance in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 28 Fisher David, et al., Evaluating ranking diversity and summarization in microblogs using hashtags University of Massachusetts, Boston, MA, Technical Report, 2015 29 Fung Gabriel Pui Cheong, et al Parameter free bursty events detection in text streams in Proceedings of the 31st international conference on Very large data bases 2005 VLDB Endowment 30 Guo Xifeng, et al Improved deep embedded clustering with local structure preservation in IJCAI 2017 31 Guo Xifeng, et al Deep clustering with convolutional autoencoders in International conference on neural information processing 2017 Springer 32 Heydari Atefeh, et al., Detection of review spam: A survey Expert Systems with Applications, 2015 42(7): p 3634-3642 33 Hosseinimotlagh Seyedmehdi and Papalexakis Evangelos E Unsupervised content-based identification of fake news articles with tensor decomposition ensembles in Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web (MIS2) 2018 34 Hu Jun and Zheng Wendong Transformation-gated LSTM: Efficient capture of short-term mutation dependencies for multivariate time series prediction tasks in 2019 International Joint Conference on Neural Networks (IJCNN) 2019 IEEE 35 Hu Xia and Liu Huan, Text analytics in social media Mining text data, 2012: p 385-414 36 Hu Xuegang, Wang Haiyan, and Li Peipei, Online Biterm Topic Model based short text stream classification using short text expansion and concept drifting detection Pattern Recognition Letters, 2018 116: p 187-194 37 Jiang Zhuxi, et al., Variational deep embedding: An unsupervised and generative approach to clustering arXiv preprint arXiv:1611.05148, 2016 38 Jindal Vasu A personalized Markov clustering and deep learning approach for Arabic text categorization in Proceedings of the ACL 2016 Student Research Workshop 2016 39 Kalogeratos Argyris, Zagorisios Panagiotis, and Likas Aristidis Improving text stream clustering using term burstiness and co-burstiness in Proceedings of the 9th Hellenic Conference on Artificial Intelligence 2016 40 Kampffmeyer Michael, et al., Deep divergence-based approach to clustering Neural Networks, 2019 113: p 91-101 41 Kim Jaeyoung, et al., Patent document clustering with deep embeddings Scientometrics, 2020: p 1-15 42 Kleinberg Jon, Bursty and hierarchical structure in streams Data Mining and Knowledge Discovery, 2003 7(4): p 373-397 43 Lahiri Shibamouli, Mihalcea Rada, and Lai P-H, Keyword extraction from emails Natural Language Engineering, 2017 23(2): p 295-317 44 Le Hong Phuong Nguyen Thi Minh, Huyen Azim Roussanaly, and Vinh Hô Tuong, A hybrid approach to word segmentation of Vietnamese texts Language and Automata Theory and Applications, 2008: p 240 45 Li Chenliang, et al., Enhancing topic modeling for short texts with auxiliary word embeddings ACM Transactions on Information Systems (TOIS), 2017 36(2): p 1-30 46 Li Chenliang, et al Topic modeling for short texts with auxiliary word embeddings in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval 2016 47 Li Hua, Text Clustering, in Encyclopedia of Database Systems, L Liu and M.T ÖZsu, Editors 2009, Springer US: Boston, MA p 3044-3046 48 Li Shan-Qing, Du Sheng-Mei, and Xing Xiao-Zhao A keyword extraction method for chinese scientific abstracts in Proceedings of the 2017 International Conference on Wireless Communications, Networking and Applications 2017 49 Liang Shangsong and de Rijke Maarten, Burst-aware data fusion for microblog search Information Processing & Management, 2015 51(2): p 89-113 50 Liang Shangsong, Yilmaz Emine, and Kanoulas Evangelos Dynamic clustering of streaming short documents in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016 51 Lynn Htet Myet, et al., Swiftrank: an unsupervised statistical approach of keyword and salient sentence extraction for individual documents Procedia computer science, 2017 113: p 472-477 52 Mai Khai, et al Enabling hierarchical Dirichlet processes to work better for short texts at large scale in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2016 Springer 53 Margara Alessandro and Rabl Tilmann, Definition of Data Streams, in Encyclopedia of Big Data Technologies, S Sakr and A.Y Zomaya, Editors 2019, Springer International Publishing: Cham p 648-652 54 Martínez-Fernández José Luis, et al Automatic keyword extraction for news finder in International Workshop on Adaptive Multimedia Retrieval 2003 Springer 55 Musselman Andrew, Apache Mahout, in Encyclopedia of Big Data Technologies, S Sakr and A.Y Zomaya, Editors 2019, Springer International Publishing: Cham p 66-70 56 Neal Radford M, Markov chain sampling methods for Dirichlet process mixture models Journal of computational and graphical statistics, 2000 9(2): p 249-265 57 Neill Daniel B and Moore Andrew W Anomalous spatial cluster detection in Proceedings of the KDD 2005 Workshop on Data Mining Methods for Anomaly Detection 2005 58 Neill Daniel B, et al Detecting significant multidimensional spatial clusters in Advances in Neural Information Processing Systems 2005 59 Nguyen Hai-Long, Woon Yew-Kwong, and Ng Wee-Keong, A survey on data stream clustering and classification Knowledge and information systems, 2015 45(3): p 535-569 60 Nguyen Tri and Do Phuc Topic discovery using frequent subgraph mining approach in International Conference on Computational Science and Technology 2017 Springer 61 Park Jinuk, et al., ADC: Advanced document clustering using contextualized representations Expert Systems with Applications, 2019 137: p 157-166 62 Peters Matthew E, et al., Deep contextualized word representations arXiv preprint arXiv:1802.05365, 2018 63 Pham Phu, Do Phuc, and Ta Chien DC GOW-LDA: Applying Term Cooccurrence Graph Representation in LDA Topic Models Improvement in International Conference on Computational Science and Technology 2017 Springer 64 Pitman Jim, Combinatorial Stochastic Processes: Ecole d'Eté de Probabilités de Saint-Flour XXXII-2002 2006: Springer 65 Qiang Jipeng, et al Topic modeling over short texts by incorporating word embeddings in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2017 Springer 66 Qiang Jipeng, et al., Short text clustering based on Pitman-Yor process mixture model Applied Intelligence, 2018 48(7): p 1802-1812 67 Quan Xiaojun, et al Short and sparse text topic modeling via self-aggregation in Twenty-fourth international joint conference on artificial intelligence 2015 68 Quan Xiaojun, et al., Latent discriminative models for social emotion detection with emotional dependency ACM Transactions on Information Systems (TOIS), 2015 34(1): p 1-19 69 Romsaiyud Walisa Detecting emergency events and geo-location awareness from twitter streams in The International Conference on E-Technologies and Business on the Web (EBW2013) 2013 The Society of Digital Information and Wireless Communication 70 Saul Lawrence K, Weiss Yair, and Bottou Léon, Advances in neural information processing systems 17: Proceedings of the 2004 conference Vol 17 2005: MIT press 71 Shah Setu and Luo Xiao Comparison of deep learning based concept representations for biomedical document clustering in 2018 IEEE EMBS international conference on biomedical & health informatics (BHI) 2018 IEEE 72 Shaham Uri, et al., Spectralnet: Spectral clustering using deep neural networks arXiv preprint arXiv:1801.01587, 2018 73 Shi Tian, et al Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations in Proceedings of the 2018 World Wide Web Conference 2018 74 Shou Lidan, et al Sumblr: continuous summarization of evolving tweet streams in Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval 2013 75 Teh Yee Whye, Dirichlet Process 2010: p 280-287 76 Teh Yee Whye, Dirichlet Process 2010 77 Tian Kai, Zhou Shuigeng, and Guan Jihong Deepcluster: A general clustering framework based on deep learning in Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2017 Springer 78 Vlachos Michail, et al Identifying similarities, periodicities and bursts for online search queries in Proceedings of the 2004 ACM SIGMOD international conference on Management of data 2004 ACM 79 Wan Haowen, et al., Research on Chinese Short Text Clustering Ensemble via Convolutional Neural Networks, in Artificial Intelligence in China 2020, Springer p 622-628 80 Wang Binyu, et al., Text clustering algorithm based on deep representation learning The Journal of Engineering, 2018 2018(16): p 1407-1414 81 Wang Mengzhi, et al Data mining meets performance evaluation: Fast algorithms for modeling bursty traffic in Proceedings 18th International Conference on Data Engineering 2002 IEEE 82 Wang Wu, et al Learning latent topics from the word co-occurrence network in National Conference of Theoretical Computer Science 2017 Springer 83 Wang Xuerui and McCallum Andrew Topics over time: a non-Markov continuous-time model of topical trends in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining 2006 84 Wang Yinglin, Wang Ming, and Fujita Hamido, Word sense disambiguation: A comprehensive knowledge exploitation framework Knowledge-Based Systems, 2020 190: p 105030 85 Wang Yu, Agichtein Eugene, and Benzi Michele TM-LDA: efficient online modeling of latent topic transitions in social media in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining 2012 86 Wang Zhiguo, Mi Haitao, and Ittycheriah Abraham, Semi-supervised clustering for short text via deep representation learning arXiv preprint arXiv:1602.06797, 2016 87 Weng Jianshu and Lee Bu-Sung, Event detection in twitter ICWSM, 2011 11: p 401-408 88 Xie Junyuan, Girshick Ross, and Farhadi Ali Unsupervised deep embedding for clustering analysis in International conference on machine learning 2016 89 Xu Dongkuan, et al Deep co-clustering in Proceedings of the 2019 SIAM International Conference on Data Mining 2019 SIAM 90 Xu Jiaming, et al., Self-taught convolutional neural networks for short text clustering Neural Networks, 2017 88: p 22-31 91 Yamamoto Shuhei, et al., Twitter user tagging method based on burst time series International Journal of Web Information Systems, 2016 12(3): p 292-311 92 Yan Xifeng and Han Jiawei gspan: Graph-based substructure pattern mining in 2002 IEEE International Conference on Data Mining, 2002 Proceedings 2002 IEEE 93 Yang Bo, et al Towards k-means-friendly spaces: Simultaneous deep learning and clustering in international conference on machine learning 2017 PMLR 94 Yang Min, et al., Cross-domain aspect/sentiment-aware abstractive review summarization by combining topic modeling and deep reinforcement learning Neural Computing and Applications, 2020 32(11): p 6421-6433 95 Yang Zaihan, et al Parametric and non-parametric user-aware sentiment topic models in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval 2015 96 Yi Junkai, et al., A novel text clustering approach using deep-learning vocabulary network Mathematical Problems in Engineering, 2017 2017 97 Yin Jianhua, et al Model-based clustering of short text streams in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2018 98 Yin Jianhua and Wang Jianyong A model-based approach for text clustering with outlier detection in 2016 IEEE 32nd International Conference on Data Engineering (ICDE) 2016 IEEE 99 Yin Jianhua and Wang Jianyong A text clustering algorithm using an online clustering scheme for initialization in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining 2016 100 Yoo Shinjae, Huang Hao, and Kasiviswanathan Shiva Prasad Streaming spectral clustering in 2016 IEEE 32nd international conference on data engineering (ICDE) 2016 IEEE 101 Yuan Chunyuan, et al Learning review representations from user and product level information for spam detection in 2019 IEEE International Conference on Data Mining (ICDM) 2019 IEEE 102 Zhang Xin, Fast algorithms for burst detection 2006, New York University, Graduate School of Arts and Science 103 Zhang Yun, Hua Weina, and Yuan Shunbo, Mapping the scientific research on open data: A bibliometric review Learned Publishing, 2018 31(2): p 95-106 104 Zhou Deyu, et al., Unsupervised event exploration from social text streams Intelligent Data Analysis, 2017 21(4): p 849-866 105 Zhu Longxia, et al., A joint model of extended LDA and IBTM over streaming Chinese short texts Intelligent Data Analysis, 2019 23(3): p 681-699 106 Zubaroğlu Alaettin and Atalay Volkan, Data stream clustering: a review Artificial Intelligence Review, 2020 107 Zuo Yuan, et al Topic modeling of short texts: A pseudo-document view in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016 108 Zuo Yuan, Zhao Jichang, and Xu Ke, Word network topic model: a simple but general solution for short and imbalanced texts Knowledge and Information Systems, 2016 48(2): p 379-398 ... đề gom cụm luồng văn ngắn; Vấn đề gom cụm luồng văn với chủ đề không cố định; Vấn đề xét mối liên hệ đồng từ gom cụm luồng văn bản; Vấn đề phát cụm từ xu nắm bắt ngữ nghĩa xu từ văn đến từ luồng; ... xuất giải 02 toán luận án Với toán thứ tốn luận án, tác giả đề xuất kỹ thuật GOW-Stream gom cụm luồng văn theo ngữ nghĩa dựa đồ thị từ Bài toán thứ hai nghiên cứu phát cụm từ xu luồng văn Chương... góp luận án: so sánh số cách tiếp cận liên quan đến gom cụm luồng văn bản, tiếp cận phát kiện phát bật luồng văn 2.1 So sánh số cách tiếp cận liên quan đến gom cụm luồng văn Các nghiên cứu gần gom

Định dạng
Số trang	140
Dung lượng	4,68 MB