NGHIÊN cứu và xây DỰNG PHƯƠNG PHÁP PHÁT HIỆN các bài VIẾT có nội DUNG PHẢN ĐỘNG

ĐẠI HỌC QUỐC GIA TP HCM TRƯỜNG ĐẠI HỌC CÔNG NGHỆ THƠNG TIN  Hồng Tuấn Long NGHIÊN CỨU VÀ XÂY DỰNG PHƯƠNG PHÁP PHÁT HIỆN CÁC BÀI VIẾT CÓ NỘI DUNG PHẢN ĐỘNG LUẬN VĂN THẠC SĨ CAO HỌC NGÀNH KHOA HỌC MÁY TÍNH Mã số: 60.48.01.01 TP HỒ CHÍ MINH - NĂM 2017 ĐẠI HỌC QUỐC GIA TP HCM TRƯỜNG ĐẠI HỌC CƠNG NGHỆ THƠNG TIN  Hồng Tuấn Long NGHIÊN CỨU VÀ XÂY DỰNG PHƯƠNG PHÁP PHÁT HIỆN CÁC BÀI VIẾT CÓ NỘI DUNG PHẢN ĐỘNG LUẬN VĂN THẠC SĨ CAO HỌC NGÀNH KHOA HỌC MÁY TÍNH Mã số: 60.48.01.01 NGƯỜI HƯỚNG DẪN KHOA HỌC: TS Ngô Thanh Hùng TP HỒ CHÍ MINH - NĂM 2017 LỜI CẢM ƠN Trong trình học tập làm luận văn tốt nghiệp cao học, giúp đỡ quý thầy, cô giáo trường Đại học Công nghệ thông tin, đặc biệt thầy TS Ngô Thanh Hùng, góp ý nhà khoa học, nhà quản lý, bạn bè, đồng nghiệp nỗ lực thân Đến nay, tác giả hoàn thành luận văn thạc sĩ với đề tài luận văn: “Nghiên cứu xây dựng phương pháp phát viết có nội dung phản động” chuyên ngành Khoa học máy tính Các kết đạt đóng góp nhỏ mặt khoa học thực tiễn việc phát viết phản động Tuy nhiên, khuôn khổ luận văn, điều kiện thời gian trình độ có hạn nên khơng thể tránh khỏi thiếu sót Tác giả mong nhận lời bảo góp ý quý thầy, giáo Tác giả bày tỏ lịng biết ơn sâu sắc tới thầy TS Ngô Thanh Hùng hướng dẫn, bảo tận tình cung cấp kiến thức khoa học cần thiết trình thực luận văn Xin chân thành cảm ơn quý thầy, cô giáo thuộc Khoa Khoa học máy tính, phịng Đào tạo Sau Đại học trường Đại học Công nghệ thông tin tạo điều kiện thuận lợi cho tác giả hồn thành tốt luận văn thạc sĩ Tác giả chân thành cảm ơn cán công tác tại trường Đại học CSND, tạo điều kiện cung cấp tài liệu liên quan giúp đỡ tác giả hồn thành luận văn TP Hồ Chí Minh, ngày 01 tháng 08 năm 2017 Học viên Hoàng Tuấn Long LỜI CAM ĐOAN Tôi cam đoan công trình nghiên cứu riêng tơi Các số liệu, kết nêu luận văn trung thực chưa cơng bố cơng trình khác Học viên Hoàng Tuấn Long MỤC LỤC Số trang Trang phụ bìa Lời cảm ơn Lời cam đoan Mục lục……………………………………………………………… Danh mục ký hiệu chữ viết tắt………………………………… Danh mục bảng…………………………………………………… Danh mục hình vẽ, đồ thị………………………………………… Chương MỞ ĐẦU………………………………………………… Chương CƠ SỞ THỰC TIỄN VÀ LÝ THUYẾT.………………… 10 2.1 Tìm hiểu hoạt động tuyên truyền viết chứa nội dung có yếu tố phản động …………………………………………………… 10 2.1.1 Hoạt động tuyên truyền viết chứa nội dung có yếu tố phản động…………………………………………………………… 10 2.1.2 Một số quan điểm, sách Đảng, Nhà nước công tác đấu tranh với hoạt động này………………………………… 11 2.1.3 Một số khó khăn, thách thức công tác này…………… 12 2.2 Phương pháp để xác định nội dung viết có yếu tố phản động 13 2.2.1 Phương pháp chuyên gia…………………………………… 13 2.2.2 Phương pháp phát thông qua cụm từ đặc trưng…… 15 2.2.3 Kỹ thuật phân lớp văn sử dụng phương pháp phân tích ngữ pháp.………………………………………………………… 16 2.2.4 Kỹ thuật phân lớp văn sử dụng phương pháp máy học thống kê………………………………………………………… 19 2.2.5 Giới thiệu Apache Spark, GraphX Scrapy…………… 20 2.3 Kết luận………………………………………………………… 24 Chương PHÂN TÍCH THIẾT KẾ THUẬT TOÁN VÀ HỆ THỐNG 25 3.1 Ý tưởng cấu trúc liệu thuật tốn…………………………… 25 3.2 Mơ tả thuật tốn………………………………………………… 28 3.3 Thuật giải xác định viết chứa nội dung có yếu tố phản động… 30 3.4 Hệ thống mở rộng tập ba dựa VietWordNet………… 32 3.5 Hệ thống tích hợp rút trích phân tích viết………………… 33 Chương THỬ NGHIỆM VÀ ĐÁNH GIÁ………………………… 36 4.1 Mơi trường thực hóa thuật tốn hệ thống……………… 36 4.2 Môi trường thử nghiệm………………………………………… 36 4.3 Dữ liệu thử nghiệm……………………………………………… 36 4.4 Kết thử nghiệm……………………………………………… 37 4.5 Đánh giá kết quả………………………………………………… 38 4.6 Kết luận………………………………………………………… 40 Chương KẾT LUẬN VÀ KIẾN NGHỊ…………………………… 41 5.1 Kết luận………………………………………………………… 41 5.2 Kiến nghị………………………………………………………… 43 DANH MỤC CÔNG BỐ KHOA HỌC CỦA TÁC GIẢ…………… 44 TÀI LIỆU THAM KHẢO…………………………………………… 45 PHỤ LỤC…………………………………………………………… 48 DANH MỤC CÁC KÝ HIỆU VÀ CHỮ VIẾT TẮT CSVN : Cộng sản Việt Nam LAS : Labeled Attachment Score LDA : Latent Dirichlet Allocation RDD : Resilient Distributed Dataset TBCN : Tư chủ nghĩa UAS : Unlabeled Attachment Score VietWordNet : Mạng từ tiếng Việt XHCN : Xã hội chủ nghĩa DANH MỤC CÁC BẢNG Số hiệu bảng Tên bảng Trang 4.1 Kết thời gian chạy phân tán với tập ba thủ công 38 4.2 Kết thời gian chạy phân tán với tập ba mở rộng 38 4.3 Kết thực nghiệm với tập ba thủ công 39 4.4 Kết thực nghiệm với tập ba mở rộng 39 DANH MỤC CÁC HÌNH VẼ, ĐỒ THỊ Số hiệu hình Tên hình Trang 2.1 Cấu trúc ba chứa phần tử ba với thuộc tính chúng 18 2.2 Mơ hình hoạt động phân tán Apache Spark 21 2.3 Mô tả luồng liệu thực Scrapy 23 3.1 Mô tả thực phân tán 30 3.2 Minh hoạ hệ thống thời gian chờ thực 34 3.3 Minh hoạ hệ thống thời gian thực rút trích 35 3.4 Minh hoạ hệ thống thời gian thực phân tích viết 35 3.5 Minh hoạ lựa chọn Stop Crawl and Process 35 4.1 Mô tả chạy phân tán với worker 37 4.2 Mô tả chạy phân tán với worker 37 Chương MỞ ĐẦU * Lý lựa chọn đề tài: Ngày nay, Internet trở thành phương tiện giúp việc truyền đạt, trao đổi thông tin, hợp tác, giao lưu… cá nhân, tổ chức quốc gia khắp hành tinh diễn nhanh chóng tiện ích, góp phần vào phát triển quyền tự ngơn luận tồn giới Với diện công nghệ thông tin truyền thông, thông tin cá nhân thực quyền tự ngôn luận gửi đến xã hội trở nên thần tốc với tốc độ mà tin tức từ bên trái đất tới bên trái đất sau phút Mọi người có quyền bình đẳng nhau, bày tỏ ý kiến diễn đàn, bình luận vấn đề liên quan đến pháp luật việc quản lý nhà nước Mọi người có hội trao đổi, thảo luận, chia sẻ buồn vui, bày tỏ ý kiến học hỏi kinh nghiệm tham gia diễn đàn Chính vậy, Internet giúp cho người toàn giới gần gũi hơn, đòn bẩy giúp phát huy sức mạnh cộng đồng, có sức mạnh người trẻ, góp phần xây dựng phát triển kinh tế tri thức Chính lợi ích mà Internet mang lại kể nguồn động lực quan trọng để thúc đẩy kinh tế nước nhà phát triển Tuy nhiên, điều tiềm ẩn yếu tố đe dọa an ninh quốc gia, trật tự, an tồn xã hội, điển hình tình trạng lực thù địch phản động sử dụng mạng Internet để tuyên truyền, đưa thông tin “thật giả, lẫn lộn” nhằm phá hoại tư tưởng, gây chia rẽ nội bộ, kích động biểu tình, gây rối, bạo loạn với mục đích xóa bỏ chế độ, lật đổ lãnh đạo Đảng chủ nghĩa Mác – Lênin tư tưởng Hồ Chí Minh Về phương thức tuyên truyền, đối tượng tiếp tục sử dụng hệ thống website, blog có máy chủ nước ngồi, đồng thời đẩy mạnh thiết lập tài khoản mạng xã hội để tuyên truyền thơng tin có nội dung xấu Từ ngày 21/11/2015 đến 01/11/2016 phát 400 trang mạng, blog (tăng 125 so với kỳ năm 2015), 554 trang facebook thường xun đăng tải thơng tin có nội dung xấu (thống kê riêng trang mạng, blog có nội dung xấu đăng tải 75000 lượt bài, tập trung vào thời điểm Đại hội Đảng 12, bầu cử Quốc hội khố 14, cố mơi trường biển số tỉnh ven biển miền Trung) Trong đề tài thực nghiên cứu đề xuất giải pháp nhằm thu thập, phân tích nội dung viết nhằm xác định viết có nội dung phản động mạng Internet Các nghiên cứu rút trích liệu từ web (mạng xã hội) Nội dung thứ Hệ thống thu thập liệu từ web (web crawler), hệ thống định nghĩa chương trình phần mềm tự động duyệt qua tải nội dung trang web nhằm phục vụ cho mục đích Hiện nay, công nghệ Internet ngày phát triển, ứng dụng mạng xã hội nâng cấp công nghệ nhằm mang lại trải nghiệm tốt cho người sử dụng Tuy nhiên mang lại khó khăn web crawler Theo thống kê từ [12], hệ thống web crawler lớn hoạt động giới như: - GoogleBot - BingBot - Facebook External hit - Baiduspider Tại Việt Nam, số web crawler phát triển hoạt động với quy mô nhỏ nhằm phục vụ cho mục đính cố định như: - Web so sánh giá aha.vn - Web tổng hợp tìm kiếm rao vặt raovat.vn - Web tổng hợp tin tức như: baomoi.com, timnhanh.com - Web tổng hợp nhạc, tìm kiếm nhạc như: baamboo.com, mp3.zing.vn, 7sac.com - Web tổng hợp tìm kiếm như: vnnsearch.com, vnsearch.net Nhìn chung, web crawler chủ yếu tập trung thu thập liệu công bố trang web thông qua đường dẫn url trực tiếp, nhiên, crawler gặp khó khăn thu thập liệu mạng xã hội, liệu hiển thị dựa theo tương tác người sử dụng Phương pháp rút trích mà web crawler thường sử dụng sử dụng API mạng xã hội [8] kỹ thuật bóc tách nội dung (html source) thu thập phương pháp định nghĩa cấu trúc trang web [6] 56 Các nghiên cứu phân lớp nội dung viết Nội dung thứ hai kỹ thuật phân tích nội dung viết, phân lớp văn tiếng Việt tập liệu lớn Bài tốn phân tích nội dung viết toán xử lý, phân loại văn truyền thống áp dụng cho liệu mạng xã hội gặp phải khó khăn dung lượng liệu cần xử lý, lên đến hàng TeraByte, ZettaByte Để lưu trữ xử lý lượng liệu công nghệ tính tốn phân tán Cluster Computing, phổ biến Hadoop MapReduce Hiện nay, nhiều phương pháp phân tích nội dung viết sử dụng kỹ thuật máy học như: - Sử dụng SVM với cách lựa chọn đặc trưng phương pháp tách giá trị đơn (SVD) [1] - Sử dụng SVM để phân lớp văn có từ khóa đánh trọng số phương pháp Hạt nhân chuỗi [2] - Xây dựng hệ thống phân loại tài liệu tiếng việt dựa phương pháp Naïve Bayes [3] Tuy nhiên nghiên cứu phương pháp phân lớp văn tiếng Việt với liệu lớn chưa có nhiều cơng bố Các nghiên cứu phân tích cảm xúc/quan điểm viết Các tác giả [11] phân lớp người đọc viết trị Hoa Kỳ làm hai loại bảo thủ tự dựa giả thiết người có khuynh hướng trị bảo thủ ủng hộ viết trị mang khuynh hướng bảo thủ, tương tự, người có khuynh hướng trị tự ủng hộ viết trị mang khuynh hướng tự Theo giả thiết đó, với tập báo tập người dùng ban đầu gán nhãn, nhóm tác giả sử dụng giải thuật máy học bán giám sát để gán nhãn cho toàn viết người dùng yêu cầu Nghiên cứu cài đặt ý tưởng với thuật toán máy học bán giám sát Random Walk with Restart, Local Consistency Global Consistency, Absorbing Random Walk Bên cạnh nghiên cứu cài đặt thuật toán phân lớp viết sử dụng SVM với từ đặc trưng phân tích Apache Lucene Kết thuật toán máy học bán giám sát cài đặt với giả thiết cho kết xác so với thuật tốn phân lớp văn SVM Nghiên cứu [9] phân viết thể quan điểm trị thành lớp: bảo thủ, tự mơ hồ (hay trung lập) dựa vào việc phân tích cảm xúc bình luận 57 người dùng Cảm xúc bình luận người dùng thống kê thành phân bố xác suất, thể xác suất bình người tích cực, tiêu cực hay trung tính viết gốc mang quan điểm trị bảo thủ, tự hay mơ hồ Sau xác định phân bố xác suất người dùng viết phân lớp vào quan điểm trị dựa vào hàm tổng hợp cảm xúc bình luận Hàm tổng hợp xây dựng nguyên tắc số phiếu tối đa xác suất hậu nghiệm tối đa Nhóm tác giả [5] thử nghiệm phân người dùng Twitter thành lớp: bảo thủ tự số phương pháp khác Trong thử nghiệm đầu tiên, text văn loại trừ stopwords, hashtags, mentions URLs đưa vào SVM để huấn luyện Các đặc trưng thuật ngữ đơn (unigrams) vector đặc trưng giá trị TF.IDF đặc trưng Trong thử nghiệm thứ hai, chi có hashtags đưa vào vector đặc trưng xác định TF.IDF hashtags Trong thử nghiệm thứ ba, nhóm tác giả biểu diễn thơng tin mối tương tác (retweets mentions) nhóm người dùng dạng mạng xã hội sử dụng thuật toán Adjust Rand Index để phân lớp người dùng Kết thử nghiệm cho thấy thuật toán phân lớp với thông tin mối tương tác người dùng Twitter có độ xác cao so với cách thuật tốn phân tích nội dung tweets Nhóm tác giả [7] đề xuất mơ hình Cross-Perspective Topic Model để phân viết thuộc chủ đề thành nhóm quan điểm khác Đề xuất cho phép lượng hóa khác biệt quan điểm viết chủ đề, qua giúp ích việc phát làm giảm khác biệt hay xung đột quan điểm trị, tơn giáo, … Nghiên cứu [10] thực thu thập tweets bầu cử Pakistan năm 2013 phân tích quan điểm tweets số thuật tốn sẵn có NB, KNN, Prind (phần mềm Rainbow) Kết cho thấy độ xác khơng cao (cao khoảng 70%) Tuy nhiên qua phân tích cho thấy việc phân tích quan điểm người dùng qua tweets phù hợp với kết bầu cử Pakistan năm 2013 Như việc sử dụng Twitter để tuyên truyền cách hiệu giúp cho đảng tăng tỷ lệ thắng cử Nghiên cứu [4] đề xuất thuật toán cải tiến thuật toán TSVM PTSVM (Progressive Transductive Support Vector Machine) Thuật toán cho phép cải tiến máy học có giám sát SVM thành máy học bán giám sát (giống TSVM) Điểm cải tiến so với TSVM vịng huấn luyện, nhãn với độ tin cậy nằm vùng 58 giới hạn cho phép coi nhãn cho lần huấn luyện Thử nghiệm cho thấy thuật tốn tăng độ xác lên 2% Tính khoa học tính đề tài Đề tài áp dụng số phương pháp tìm hiểu vào việc thu thập liệu mạng Internet, phân tích nội dung liệu thu thập nên tảng xử lý liệu lớn để xác định viết có nội dung phản động hay khơng Từ góp phần vào việc xác định cá nhân phản động, sử dụng mạng xã hội để tuyên truyền vào nhân dân, tầng lớp thiếu niên, tư tưởng chống phá nhà nước ta Các phương pháp đề xuất nghiên cứu, chọn lọc thay đổi cho phù hợp với lĩnh vực áp dụng Đề tài thực giai đoạn bước đấu tranh với lực phản động sử dụng mạng Internet để tuyên truyền, chống phá Đảng Nhà nước ta Các thơng tin giúp ích cho nhà quản lý công đấu tranh mặt trận tư tưởng, đặc biệt thời đại công nghệ thông tin ngày phát triển, mạng xã hội tràn lan với lượng thông tin lớn kiểm soát, lực thù địch ngày sử dụng nhiều âm mưu, thủ đoạn tinh vi, phức tạp Mục tiêu, đối tượng phạm vi 6.1 Mục tiêu: Mục tiêu luận văn nghiên cứu xây dựng phương pháp phát viết có nội dung phản động sử dụng kỹ thuật phân lớp Để đạt mục tiêu đó, đề tài thực nội dung là: xây dựng kỹ thuật thu thập thông tin xây dựng phương pháp phân lớp văn nhằm xác định viết có nội dung phản động hay không 6.2 Đối tượng: Các viết Internet, kỹ thuật để phân lớp viết, phân tích cảm xúc 6.3 Phạm vi Các viết rút trích từ trang web, blog, viết sử dụng để huấn luyện kiểm thử viết từ tiếng Việt (hoặc chỉnh sửa từ không từ vựng tiếng Việt cho phù hợp) Nội dung, phương pháp Nội dung 1: Nghiên cứu hệ thống thu thập liệu mạng Internet 59 - Mục tiêu: Xây dựng hệ thống thu thập tự động mạng Internet - Phương pháp: Nghiên cứu thông qua báo, luận văn, luận án, sách chun khảo, giáo trình; cài đặt lại số kỹ thuật để kiểm chứng Dự kiến sử dụng hỗ trợ từ API mạng kỹ thuật bóc tách nội dung thu thập phương pháp định nghĩa cấu trúc trang web [6] - Kết quả: hệ thống rút trích, lưu trữ liệu (bài viết) Nội dung 2: Tìm hiểu viết có nội dung phản động mạng Internet - Mục tiêu: Nhằm hiểu loại thông tin phản động mạng xã hội, đặc điểm loại thông tin phản động biện pháp nghiệp vụ nhằm phát dấu hiệu phản động, quy định Bộ Cơng an Việt Nam có liên quan - Phương pháp: Khảo sát thông qua tài liệu nghiệp vụ, văn pháp luật, báo cáo, báo - Kết quả: định hướng cho lựa chọn, xây dựng phương pháp phân lớp văn Nội dung 3: Nghiên cứu áp dụng kỹ thuật phân tích nội dung viết dựa kỹ thuật phân tích văn kỹ thuật xử lý liệu lớn để phát viết có nội dung phản động - Mục tiêu: Dựa phương pháp phân lớp văn đặc điểm văn có tính chất phản động nghiên cứu được, học viên thực chọn lựa, cài đặt cải thiện độ xác kỹ thuật phân lớp văn tiếng Việt với liệu lớn nhằm xác định viết có mang tính chất phản động hay không - Phương pháp: nghiên cứu tài liệu phân lớp văn bản, phân tích cảm xúc/quan điểm viết, kỹ thuật xử lý song song hoặc/và phân tán, phương pháp cài đặt Dự kiến sử dụng phương pháp xây dựng tập từ đặc trưng phương pháp máy học SVM (hoặc phương pháp cải tiến SVM) tập từ vựng đặc trưng - Kết quả: phương pháp phần mềm phân lớp văn tiếng Việt có mang tính phản động hay không Nội dung 4: Thử nghiệm phương pháp đề xuất - Mục tiêu: Thực nghiệm để kiểm chứng hiệu kỹ thuật tích hợp 60 - Phương pháp: gán nhãn liệu huấn luyện kiểm tra; thực huấn luyện, kiểm thử; thống kê, đánh giá, giải thích kết Nội dung 5: Viết báo khoa học gửi xuất - Mục tiêu: Viết báo khoa học kết luận văn - Nội dung dự kiến: phương pháp phân lớp văn tiếng Việt để biết văn có mang tính phản động hay không với liệu lớn - Nơi gửi dự kiến: hội thảo tạp chí nước Kế hoạch bố trí thời gian NC - Thời gian thực tháng - Kế hoạch làm việc với người hướng dẫn: Gặp trao đổi lần/1 tuần; thường xuyên trao đổi điện thoại e-mail - Tham gia sinh hoạt lớp, tham dự khóa sinh hoạt chuyên đề liên quan - Kế hoạch thực chi tiết: Thời gian Công việc thực Tuần 1, 2, 3, - Tìm hiểu loại thông tin phản động, quy định pháp lý, quy định nghiệp vụ, … - Tìm hiểu hệ thống thu thập thông tin mạng Tuần 5, 6, 7, 8, 9, 10 - Tìm hiểu xây dựng liệu kiểm thử - Seminar kết nghiên cứu PTN HTTT - Xây dựng giải thuật phát viết có nội dung phản động Tuần 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 - Thử nghiệm giải thuật cải tiến giải thuật - Seminar kết nghiên cứu PTN HTTT - Viết báo gửi hội nghị/tạp chí Tuần 21, 22, 23, 24 - Đánh giá thử nghiệm giải thuật, so sánh với giải thuật khác - Viết báo cáo luận văn 61 Tài liệu tham khảo gồm tài liệu có liên quan với đề tài NC; ngồi tài liệu kinh điển phải có tài liệu cận đại (năm năm trở lại) liên quan đến đề tài Viết theo mẫu+, thuyết minh đề tài phải tham chiếu đến tài liệu tham khảo Tiếng Việt [1] Trần, Đệ Cao; Phạm, Khang Nguyên, Phân loại văn với máy học vector hỗ trợ định, Tạp chí Khoa học, 21a, trang 52-63, 2012 [2] T T Huỳnh S T Trần, Hệ thống nhận dạng phân loại văn bản, Đại học Cơng Nghệ Thơng Tin, Hồ Chí Minh, 2007 [3] T T T Trần, C T Vũ N Tạ, Xây dựng hệ thống phân loại tài liệu Tiếng Việt, Khoa Công nghệ Thông tin, Trường ĐH Lạc Hồng, Biên Hòa, 11/2012 Tiếng Anh [4] Bao, Y., & Quan, C.(2014, December) A Novel PTSVM Algorithm for Twitter Sentiment Analysis International Journal of Advanced Intelligence, Volume 6, Number 1, pp.1-11, AIA International Advanced Information Institute [5] Conover, M D., Gonỗalves, B., Ratkiewicz, J., Flammini, A., & Menczer, F (2011, October) Predicting the political alignment of twitter users In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on (pp 192-199) IEEE [6] Do Van, Nhon, Vu Lam Han, and Trung Le Bao "News Aggregating System Supporting Semantic Processing Based on Ontology." In Knowledge and Systems Engineering, pp 285-297 Springer International Publishing, 2014 [7] Fang, Y., Si, L., Somasundaram, N., & Yu, Z (2012, February) Mining contrastive opinions on political texts using cross-perspective topic model In Proceedings of the fifth ACM international conference on Web search and data mining (pp 63-72) ACM [8] Friesel, Rob PhantomJS Cookbook Packt Publishing Ltd, 2014 [9] Park, S., Ko, M., Kim, J., Liu, Y., & Song, J (2011, March) The politics of comments: predicting political orientation of news stories with commenters' sentiment patterns In Proceedings of the ACM 2011 conference on Computer supported cooperative work (pp 113-122) ACM 62 [10] Razzaq, M A., Qamar, A M., & Bilal, H S M (2014, August) Prediction and analysis of Pakistan election 2013 based on sentiment analysis In Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on (pp 700-703) IEEE [11] Zhou, D X., Resnick, P., & Mei, Q (2011, July) Classifying the Political Leaning of News Articles and Users from User Votes In ICWSM Website [12] Top 10 crawler in the world, https://www.incapsula.com/blog/know-your-top10-bots.html [13] Facebook statistics worldwide, http://www.allin1social.com/facebook- statistics/countries/ TP HCM, ngày 17 tháng 07 năm 2016 HỌC VIÊN KÝ TÊN (Họ tên chữ ký) NGƯỜI HƯỚNG DẪN (Họ tên chữ ký) Hồng Tuấn Long TS Ngơ Thanh Hùng 63 A method for finding document containing reactionary viewpoints Hoang Tuan Long1, Tran Phuc Anh2, Ngo Thanh Hung3 People’s Police University - Ho Chi Minh City, Vietnam University of Information Technology, Vietnam National University - Ho Chi Minh City, Vietnam Ph.D, Faculty of Information Systems, University of Information Technology, Vietnam National University - Ho Chi Minh City, Vietnam Abstract This study presents a method to identify articles written in Vietnamese on the internet that contain reactionary viewpoints against the Government of Vietnam and the leadership of the Communist Party of Vietnam These articles often comprise various errors such as spelling mistakes, typos, misplaced punctuation marks, new and unfamiliar “terms” to Vietnamese people, etc Hence, it is not appropriate to apply grammatical and vocabulary analysis methods We propose to use the word orders in triplet form (Subject, Verb, Object) and its variables including doublet form (Subject, Predicate, null) or (Verb, Object, null), and singulet form (Subj, null, null) to screen these articles in accordance with the following principle: if one article has at least one sentence containing the elements of such word orders, the article will be considered as containing reactionary viewpoints The original triplets are established based on the training corpus (dataset), and then extended using the synonyms in VietWordNet The extension of triplets is able to increase the accuracy of this algorithm significantly The Program can help professional security units to reduce human resources and enhance operational effectiveness Keywords: document analysis, document classification, reactionary viewpoint, triple, edge triplet, triplet finding, Spark GraphX, VietWordNet Introduction The war that reunited the North and the South of Vietnam ended more than 40 years ago but a number of the defeated parties who are living in many different countries are still trying to incite Vietnamese people inside and outside the country to involve in sabortage activities against the Government of Vietnam One of the most effective ways to disseminate their reactionary viewpoints is using social media and webpages/weblogs Therefore, one of the tasks for ensuring the political security and social order for the country is to identify the articiles containing reactionary viewpoints, expose the wrongful allegations and provide the grassroots with clear explanations This study proposes a method to detect articles containing reactionary viewpoints in order to assist the above mentioned task This problem belongs to the field of document classification A number of related studies are presented below 1.1 Professional method An organ of public security sector is assigned to detect the articles containing reactionary viewpoints The competent officials enter social media pages or webpages/weblogs where reactionary viewpoints are often expressed to analyze and determine wherether an article is reactionary or not [1] based on one of the following factors: - Containing phrases that express incitations, defamations, and distortions of the government authority, the Communist Party of Vietnam, the famous people of the Communist Party of Vietnam, or historical details with the purposes of disseminating suspections, discontentments, and mistrusts in the government authority, the Communist Party of Vietnam, the respectful people of the Communist Party of Vietnam, or historical details; or containing incitation phrases that stimulate people to take part in illegal meetings, causing internal riots or overturns on large scales with the supports of external reactionary organizations Example 1: Analyzing the following sentence: “Nhưng đây, lại có vấn đề mà muốn làm rõ, với cá nhân ơng "trí ngủ" Đỗ Văn Xê mà với ngàn dư luận viên ngày đêm giúp đảng CSVN che giấu thật tồi bại đảng.” The article which has one sentence containing phrases of this type will be considered as containing reactionary elements As in example 1, the phrase is: “sự thật tồi bại đảng” 64 Then, the article will be thoroughly analysed, focusing on the phrases or sentences containing the reactionary elements, in order to identify every reactionary allegation and find out appropriate explanations or disseminations to people; - The article does not contain such noticeable phrases, but distorts the truths or incites riots or overturns in hilarious or figurative ways Example 2: Analyzing the following sentence: “Cuộc CCRĐ Hồi Thứ Nhất với mục tiêu lừa đảo Người Cày Có Ruộng kéo dài từ năm 1953 đến năm 1956 diệt chủng long trời lở đất đến phải sửa sai chấm dứt Trong “bác Hồ” đóng phim nhỏ vài giọt lệ khóc người chết oan Võ Đại tướng phải thay mặt cụ Tổng bí thư Trường Chinh đứng nhận sửa sai.” This type of article, though normally does not contain such keyphrases, still contains reactionary elements The second example has no keyphrase but still implies defamation of the leaders causing people’s mistrusts The articles in this form will be analyzed carefully in every aspect and implication to identify reactionary allegation in order to find out suitable explanations and disseminations In reality, most of the reactionary articles fall into the first type, i.e containing reactionary phrases This type accounts for more than 95% of the total number of reactionary articles Since the competent organ is still using the manual method without the assistance of computerized programs, the works are costly and labor-intensive while detection of reactionary phrases by computerized programs are feasible and not very complicated The development of programs to detect articles containing implicative distortions or riot incitations is more challenging, even infeasible under the current study situations The following overview will only present the analyses of the studies in order to detect the articles that contain reactionary phrases, not the ones that contain implicative reactionary elements 1.2 Keyphrase finding method The above professional analysis shows us one of the simplest methods for analysing a single sentence to detect the appearance of keyphrases – hereinafter referred to as the first method If the sentence contains a reactionary element, it is determined that the article contains reactionary elements Otherwise, if none of the sentences in the article contains any reactionary elements, it is determined that the article does not contain reactionary elements Some phrases, however, can be rephrased into various grammatical elements like S-V-O, for instant: “chính quyền đán áp nhân dân” can be rephrased into “chính quyền” as S, “đàn áp” as V, and “nhân dân” as O, then it might not be feasible to detect the entire phrase in a sentence if the writer added another subject/adverbal phrase/predictate/ or object in the middle A sentence which is added more words but still follows S-V-O structure will not be screened by simple finding algorithims Thus, the accurary of the algorithms will be significantly reduced Example 3: Analyzing the following sentence: “Hiện nay, quyền sức đàn áp nhân dân tham gia biểu tình chống đối lại định họ.” In this example, the S-V-O structure is used: “chính quyền, đàn áp, nhân dân” However, the additional words in the middle makes it impossible to detect the reactionary elements for the entire phrase 1.3 Document classification using semantic analysis method The next method for solving the problem is the use of gramamatical analysis algorithms to detect reactionary phrases - hereinafter referred to as the second method There are not many studies on grammatical analyses for Vietnamese language Most of the studies focus on separations of words and phrases [2]; some others study on grammatical functions of words and phrases in sentences [3-7] The second method will provide more accurate results in comparison to the first one since it is able to determine exactly whether the elements of a phrase appearing in a sentence have semantic connections or not Example 4: Analyzing the following sentence: “Dưới lãnh đạo Đảng Cộng sản chúng thấy dã man, tàn độc lực thù địch.” then the phrase “dã man, tàn độc” does not modify the phrase “Đảng Cộng sản” In fact, the percentage of sentences containing all elements of a reactionary phrase without any grammatical connections is very low The second method also has some shortcomings when being applied, i.e if a sentence contains spelling mistakes, new/unpopular/borrowed terms and words, grammatical errors due to wrong or missing punctuation marks and conjunctions, the algorithm does not work or can produce a wrong result while these cases are very common in reality 65 Example 5: Analyzing the following sentence: “Nhưng kế hỗn binh kẻ câm quyền, cộng sản, hệ thống đảng, chuyên lừa, lọc dối trá.” (when being correctly written, it should mean: “But that is only the temporization of the power holders, the communist people, the party system who are deceptive and cheating”) We can see that the algorithm provides an incorrect result due to the wrong punctuation mark between “lừa, lọc” and the spelling mistake of the word “câm quyền”, thus the accuracy of the algorithm is lowered 1.4 Document classification using statistical machine learning methods Such statistical machine learning methods as Bayes, LDA,… [8-12], hereinafter referred to as the third method, help to classify texts, firstly by identification of the words representing each classification, and then relevant functions will be applied to screen the text to determine the classification that each keyphrase belongs to This method is more advantageous over the first and second methods because it automates the establishment of key words and phrases while it is still able to avoid the shortcomings of the first method Even though, for separation of words and phrases, the third method also has similar shortcomings to the second method including spelling mistakes, grammatical errors, appearance of borrowed and unpopular terms, etc These shortcomings significantly reduce the efficiency of this method PROPOSED METHOD USING THE TRIPLET 2.1 Proposed of Data Structure and Algorithm According to the above analyses, the first method, though very simple, is effective in solving this problem Hence, we propose to use this method with a minor change: each keyphrase will be analysed as a set of elements with semantic relations Each set will have maximum elements called a triplet The forms of semantic relations can be Subject -Verb -Object, Noun- Modified Adjective, Verb – Modified Noun, Noun, etc Example 6: The one - element triplet (“Hồ động chủ”, “”, “”) The two - element triplet (“nhà nước”, “độc tài”, “”), (meaning “government”, “dictatorship”, “”), when being combined as a phrase, will express the “dictatorship” policy of the “government” that the reactionists want to broadcast Regarding the three - element triplet of (“chính quyền”, “đàn áp”, “nhân dân”), (meaning “authority”, “suppress”, “people”), when these elements are combined together, the phrase expresses the cruel policy of the “authority” toward their “people”, which incites people and induces them to involve in wrongful activities Proposed algorithm will finding the appearance of triplet elements in all sentences If a sentence contains all the triplet elements following their orders in the triplet, such text will be marked as containing reactionary viewpoints, and vice versa As a result, the proposed algorithm can limit the shortcomings of the first method because it still helps to identify the sentences containing original keyphrases (when the triplet elements have not been split) with additional words/phrases in the middle However, this algorithm also has some disadvantages that the first method encountered, i.e the cases that the elements of a triplet appear in a sentence in good orders without any semantic relations In fact, among the sentences that contains keyphrases with additional words, the percentage of sentences having elements with grammatical relations is considerably higher than the percentage of sentences having unrelated elements without relations among its elements Because the language is very rich and diversified, one word or phrase often has one or more synonyms For a comprehensive coverage of various ways to express a reactionary viewpoint – a triplet, we propose to extend the triplet by using the synonyms described in VietWordNet [13] Example 7: for the manually supplemented triplet of (“chính quyền”, “đàn áp”, “nhân dân”), if the synonyms of the word “chính quyền” are taken from VietWordNet, we can come up with following additional triplets: (“chính phủ”, “đàn áp”, “nhân dân”), (“nhà nước”, “đàn áp”, “nhân dân”), etc 2.2 Realization of the triplet finding algorithm based on the Spark GraphX platform The above mentioned algorithm has been realized using the graphic data structure in Spark GraphX [14, 15] The first and third elements of the triplet (can be empty) are displayed as the vertexes of the graph with the main features being a string of characters corresponding to such elements The middle element of the triplet 66 (can be empty) is presented by edge of the graph with the main features being a string of characters corresponding to such element Example 8: Triplet (“”,””, “Hồ tệ”) is represented as vertexes, edge as follows: Vertex(0L, “”), Vertex (1L, “Hồ tệ”), Edge(1L, 0L, “”) Triplet (“chính quyền”, “đàn áp”, “nhân dân”) is represented as vertexes, edge as follows: Vertex(2L, “chính quyền”), Vertex(3L, “nhân dân”),, Edge(2L, 3L, “đàn áp”) The algorithm to detect the appearance of triplets will be dispersedly conducted on Spark platform as described below: - The set of triplets will be generated from all edge triplets of the graph and converted into a RDD structure to distributed to the Spark cluster nodes; - The set of text sentences will be extracted from the article and converted into a broadcast structure and sent to all nodes of cluster; - At each nodes, every sentence of the article will be compared with the triplet elements received by the subcomputer If a sentence contains sufficient elements of a triplet following their triplet orders, it is considered as a match and noted in the relevant results for the text - Once the process ends at all the computers, the matching data will be reviewed If there is at least one match detected, the text is marked as containing reactionary elements, and vice versa 2.3 Realization of algorithm to extend the triplets based on VietWordNet VietWordNet [13] has been based on the wordnet Princeton version 3.0 The compilation of data has been inherited and developed from the WNMS tools, a web protocol developed under the AsianWordNet project by Thai Computational Linguistic (TCL) Laboratory in cooperation with Japan’s National Institute of Information and Communications Technology (NICT) Viet WNMS has been customized and improved to accommodate Vietnamese language VietWordNet comprises 40,788 synonyms with 67,344 word units including 40,788 common Vietnamese words It is possible to extend the triplets with the assistance of the language data provided by VietWordNet TESTING AND ASSESSMENT The proposed algorithm was tested as follows: - Firstly, development of test datasets The articles in reactionary webpages as well as in other (nonreactionary) webpages were extracted and analyzed for being tagged as containing non-reactionary or reactionary viewpoints, and the keyphrases expressing reactionary viewpoints (if any) were recorded This resulted in the identification of 100 articles containing reactionary viewpoints and 100 articles containing nonreactionary viewpoints Next, the articles tagged as containing reactionary viewpoints were classified into two sets according to the timely orders: Training Corpus consisting 50 earlier posts was used for training in order to develop triplets based on manual method; Test Dataset A consisting 50 later posts was used for testing The set of articles containing non-reactionary viewpoints were gathered into Test Dataset B - Development of triplets For the Training Corpus, all the keyphrases were extracted and developed into various elements of triplets by manual method This resulted in 400 triplets including 60 one- element triplets, 110 two-element triplets, and 230 three-element triplets - Testing triplets developed by manual method The algorithm was tested with the triplets that were developed by the manual method for tree datasets: Training Corpus, Test Dataset A and Test Dataset B The test results were presented in Table According to the obtained results, the accuracy of the algorithm was 100% for the Training Corpus, 100% for Test Dataset B since the articles in this set did not contain keyphrases, and only 56% for Test Dataset A since this set contained different keyphrases than the training process Analysis Test Dataset A only has 28 articles contain keyphrases in triplets developed by manual method 67 - Proceeding to extension of triplets The algorithm for extension of triplets using the synonyms in VietWordNet increased the number of triplets to 521,200, including 60 one-element triplets, 5,910 two-element triplets, and 515,230 three-element triplets - Testing extended triplets The algorithm was also tested with the extended triplets for the same datasets used for the previous test The test results were presented in Table It was obvious that the accuracy for Test Dataset A was increased significantly to 78% This could be explained by the presents of some keyphrases in the extended triplets thanks to the synonyms in VietWordNet that had not appeared in the original triplets for Test Dataset A Thus, it was proved that the extension of triplets using VietWordNet helped to increase the accuracy of the algorithm The testing extended triplets results for Test Dataset A is higher than the testing triplets developed by manual method results of 22% After Test Dataset A analyzed, it includes 28 articles contain keyphrases in triplets developed by manual method, 11 articles contain keyphrases in extended triplets, 10 articles contain new keyphrases and article does not contain keyphrases, but containing reactionary viewpoints The results of both tests showed that the accuracy for the “prospective” dataset was still low (less than 80%) This is the shortcoming of the algorithm For practical application, it requires updated triplets to express new and unpopular reactionary viewpoints However, such new and unpopular reactionary viewpoints rarely appear, accounting for about 22% of the cases, and the existing viewpoints expressed in corresponding terms are prevailing, accounting for approximately 78% Conclusion The study considered different solutions for the problem of determining whether the articles in webpages contain reactionary viewpoints or not as a tool to improve the effectiveness of information dissemination activities The study proposes the use of triplets and the algorithm of extended triplets with the synonyms in VietWordNet to solve this problem and to realize the algorithm using graph data structure in Spark GraphX The algorithm gains high accuracy for the training corpus and acceptable accuracy for the test dataset with extended triplets The experiment has proved that the proposed algorithm is able to increase the effectiveness of text classification as it can detect a considerable percentage (more than 70%) of the articles containing reactionary 68 viewpoints by using keyphrases For more effective application of the solution in reality, the rest articles of less than 30% tagged as containing non-reactionary viewpoints must be reanalyzed In case it finds out reactionary phrases, the triplets should be duly updated This study can be further developed by the combination with the text classification method using statistical machine learning to establish original triplets, and/or with the grammatical analyzing method to check the grammatical relations among the elements of a triplet in a sentence References Đào Thanh Quyên (2016), “Đổi nội dung, phương thức đấu tranh chống quan điểm sai trái mạng internet nay”, http://www.tuyengiao.vn/Home/Bao-ve-nen-tang-tu-tuong-cua- Dang/87873/Doi-moi-noi-dung-phuong-thuc-dau-tranh-chong-quan-diem-sai-trai-tren-manginternet-hien-nay Phuong Le Hong, Huyen Nguyen Thi Minh, Vinh Ngo Tuong (2008), “A Hybrid Approach to Word Segmentation of Vietnamese Texts”, 2nd International Conference on Language and Automata Theory and Applications L N Thi, L H My, H N Viet, H N T Minh and P L Hong, "Building a treebank for Vietnamese dependency parsing," In Proceedings of the 10th IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future, pp 147-151 D Nguyen, D Nguyen, S Pham, P.-T Nguyen and M L Nguyen (2014), "From treebank conversion to automatic dependency parsing for Vietnamese," In Proceedings of 19th International Conference on Application of Natural Language to Information Systems, pp 196-207 C Vu-Manh, A T Luong and P Le-Hong (2015), "Improving Vietnamese Dependency Parsing Using Distributed Word Representations," Proceedings of the Sixth International Symposium on Information and Communication Technology, pp 54- 60 D Q Nguyen, M Dras and M Johnson (2016), "An empirical study for Vietnamese dependency parsing," In Proceedings of Australasian Language Technology Association Workshop, pp 143-149 K V Nguyen and N L.-T Nguyen (2016), "Vietnamese Dependency Parsing with Supertag Features," 2016 Seventh International Conference on Knowledge and Systems Engineering (KSE) T T T Trần, C T Vũ N Tạ (2012), “Xây dựng hệ thống phân loại tài liệu Tiếng Việt”, Khoa Công nghệ Thông tin, Trường ĐH Lạc Hồng, Biên Hòa Đ C Trần, K N Phạm (2012), “Phân loại văn với máy học vector hỗ trợ định”, Tạp chí khoa học, Trường Đại học Cần Thơ 10 T T V Nguyễn (2012), “Nghiên cứu số thuật toán học máy có giám sát ứng dụng lọc thư rác”, Luận văn thạc sỹ kỹ thuật, Học viện công nghệ bưu viễn thơng, Hà Nội 11 D Blei, A Ng and M Jordan (2003), “Latent Dirichlet Allocation”, Journal of Machine Learning Research 12 P N Trần, V T Phạm, X C Phạm, Q V D Nguyễn (2013), “Phân loại nội dung tài liệu web tiếng Việt”, Tạp chí Khoa học Công nghệ 51 13 Bộ Khoa Học Công Nghệ, (Accessed July 2017) http://viet.wordnet.vn 69 14 .X Meng, J Bradley, B Yavuz, E Sparks, S Venkataraman, D Liu, J Freeman, D Tsai, M Amde, S Owen, R Xin, M Franklin, R Zadeh, M Zaharia and A Talwalkar† (2016), “MLlib: Machine Learning in Apache Spark”, Journal of Machine Learning Research 17 15 J Gonzalez, R Xin, A Dave, D Crankshaw, M Franklin and I Stoica (2014), “GraphX: Graph Processing in a Distributed Dataflow Framework” 70 ... 50 0 50 28 Bài viết có yếu tố Bộ huấn phản động luyện Bài viết khơng có yếu tố phản động Bộ thử Bài viết có yếu tố nghiệm A + phản động Bộ thử Bài viết khơng có nghiệm B yếu tố phản động 100%... rộng Dữ liệu Bài viết có yếu tố Bộ huấn phản động luyện Bài viết khơng có yếu tố phản động Bộ thử Bài viết có yếu tố nghiệm A + phản động Bộ thử Bài viết khơng có nghiệm B yếu tố phản động Số lượng... tài "Nghiên cứu xây dựng phương pháp phát viết có nội dung phản động" làm đề tài luận văn thạc sĩ Trong phạm vi đề tài này, tác giả thực nghiên cứu đề xuất giải pháp nhằm phân tích nội dung viết,

Định dạng
Số trang	74
Dung lượng	7,86 MB