Nghiên cứu các phương pháp học sâu cho bài toán phân loại văn bản tin tức tiếng việt luận văn thạc sĩ công nghệ thông tin

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC LẠC HỒNG NGHIÊN CỨU CÁC PHƯƠNG PHÁP HỌC SÂU CHO BÀI TOÁN PHÂN LOẠI VĂN BẢN TIN TỨC TIẾNG VIỆT LUẬN VĂN THẠC SĨ CÔNG NGHỆ THÔNG TIN ĐỒNG NAI – NĂM 2022 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC LẠC HỒNG NGHIÊN CỨU CÁC PHƯƠNG PHÁP HỌC SÂU CHO BÀI TOÁN PHÂN LOẠI VĂN BẢN TIN TỨC TIẾNG VIỆT Chuyên ngành: Công nghệ thông tin Mã số: 8480201 LUẬN VĂN THẠC SĨ CÔNG NGHỆ THÔNG TIN ĐỒNG NAI – NĂM 2022 LỜI CAM ĐOAN Tôi xin cam đoan kết đạt luận văn sản phẩm riêng cá nhân, kết trình học tập nghiên cứu khoa học độc lập Trong toàn nội dung luận văn, điều trình bày cá nhân tổng hợp từ nhiều nguồn tài liệu Tất tài liệu tham khảo có xuất xứ rõ ràng trích dẫn hợp pháp Tơi xin hồn tồn chịu trách nhiệm chịu hình thức kỷ luật theo quy định cho lời cam đoan Đồng Nai, ngày tháng năm 2022 Tác giả LỜI CẢM ƠN Em xin chân thành cảm ơn thầy cô trường Đại học Lạc Hồng, tận tình dạy dỗ, giúp đỡ tạo điều kiện tốt cho em suốt quãng thời gian em theo học trường, để em hồn thành luận văn Em tỏ lòng biết ơn sâu sắc với PGS.TS , người thầy tận tình hướng dẫn khoa học giúp đỡ, bảo em suốt trình nghiên cứu hoàn thành luận văn Xin trân trọng cảm ơn! MỤC LỤC LỜI CAM ĐOAN LỜI CẢM ƠN MỤC LỤC DANH MỤC CÁC HÌNH DANH MỤC CÁC KÝ HIỆU, CÁC TỪ VIẾT TẮT 1.1 Đặt vấn đề 1.2 Lý chọn đề tài 1.3 Mục tiêu đề tài 1.3.1 Mục tiêu tổng quát 1.3.2 Mục tiêu cụ thể 1.4 Nội dung nghiên cứu 1.5 Phương pháp nghiên cứu 1.6 Đóng góp đề tài 1.7 Cấu trúc luận văn Chương - Chương trình minh họa CHƯƠNG 2: CƠ SỞ LÝ THUYẾT 2.1 Nghiên cứu liên quan 2.1.1 Nghiên cứu nước 2.1.2 Nghiên cứu nước 2.2 Tổng quan máy học 2.2.1 Máy học gì? 2.2.2 Mơ hình máy học truyền thống 10 2.2.2.1 Mô hình máy học Support Vector Machine 10 2.2.2.2 Mơ hình máy học Nạve Bayes 12 2.2.3 Mơ hình học sâu 13 2.2.3.1 Mô hình mạng học sâu tích chập – Convolutional Neural Network 13 2.2.3.2 Mơ hình mạng học sâu hồi quy – Long short-term Memory 17 2.2.4 Mơ hình ngôn ngữ 19 2.2.5 Phương pháp biểu diễn văn 23 CHƯƠNG 3: XÂY DỰNG MƠ HÌNH PHÂN LỚP VĂN BẢN 27 3.1 Mô tả toán 27 3.2 Mơ hình thử nghiệm 27 3.2.1 Mơ hình máy học truyền thống 27 3.2.2 Phương pháp biểu diễn TF-IDF 28 3.2.3 Mơ hình học sâu 30 3.2.4 Mơ hình ngơn ngữ 31 CHƯƠNG 4: THỰC NGHIỆM 34 4.1 Cài đặt mơ hình 34 4.2 Bộ liệu thử nghiệm 34 4.3 Độ đo đánh giá 36 4.4 Kết thí nghiệm 37 CHƯƠNG KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN 45 5.1 Kết luận 45 5.2 Hướng phát triển 45 TÀI LIỆU THAM KHẢO 47 DANH MỤC CÁC HÌNH Hình 1.1 Bài tốn phân loại thư rác (spam classification) ứng dụng email Hình 2.1 Sơ đồ tổng quan mối quan hệ cách tiếp cận học máy 10 Hình 2.2 Ví dụ siêu phẳng với lề cực đại không gian hai chiều 11 Hình 3.1 Mơ hình đề xuất toán phân loại văn 27 Hình 3.2 Sơ đồ tổng quan trình thử nghiệm mơ hình học sâu CNN LSTM 30 Hình 3.3 Mơ hình mạng tích chập CNN cho tốn phân loại tin tức 31 Hình 4.1 Biểu đồ so sánh số lượng hai tập liệu 36 Hình 4.2 Cách tính Độ xác độ phủ 37 Hình 4.3 Cách tính Độ xác độ phủ 39 Hình 4.4 Biểu đồ thể độ xác hàm mát mơ hình CNN tổng số lượng epoch 39 Hình 4.5 Ma trận nhầm lẫn mơ hình mạng tích chập CNN tập liệu 40 Hình 4.6 Ma trận nhầm lẫn mơ hình BERT tập liệu kiểm tra 42 Hình 4.7 Biểu đồ hiển thị giá trị độ xác hàm mát mơ hình BERT43 Hình 4.8 Trình bày biểu đồ so sánh độ đo F1 nhãn với liệu kiểm tra mơ hình BERT 44 Hình 5.1 Trang giao diện chương trình ứng dụng minh họa phân loại tin tức văn tiếng Việt 14 Hình 5.2 Trang giao diện người dùng nhập vào nội dung văn tin tức để mơ hình xác định chủ đề văn 14 Hình 5.3 Kết giao diện ứng dụng người dùng bấm nút dự đoán hệ thống phân loại nhãn dự đoán chủ đề cho văn đầu vào 15 Hình 5.4 giao diện cho phép người dùng nhập vào file liệu định dạng txt để dự đoán xem chủ đề tập tin đầu vào 15 Hình 5.5 giao diện người dùng thao bấm nút “upload” chọn tập tin example1.txt lên để tiến hành dự đoán chủ đề cho văn tập tin txt 16 Hình 5.6 Kết dự đoán hệ thống nhập đầu vào tập tin example1.txt trả kết thị cho người dùng nội dung liệu tập tin kết dự đốn mơ hình 16 Hình 5.7 Trang thơng tin liệu huấn luyện liệu kiểm tra ứng dụng nghiên cứu 17 Hình 5.8 Thơng tin kết thử nghiệm mơ hình đề tài 17 DANH MỤC BẢNG Bảng 4.1 Bảng thống kê số lượng so sánh hai tập liệu huấn luyện tập liệu kiểm tra 35 Bảng 4.2 Kết độ đo phương pháp khác tập liệu kiểm tra 37 Bảng 4.3 Bảng tham chiếu tên nhãn theo số thứ tự mã hóa liệu 40 Bảng 4.4 Kết chi tiết nhãn chủ đề tập liệu kiểm tra mơ hình BERT 43 DANH MỤC CÁC KÝ HIỆU, CÁC TỪ VIẾT TẮT Ký hiệu AI Thuật ngữ Artificial Intelligence BERT Bidirectional Encoder Representations from Transformers CNN Convolutional Neural Network LSTM NB ReLU SVM Long short-term Memory Naïve Bayes Rectified Linear Unit Support Vector Machine CHƯƠNG 1: TỔNG QUAN 1.1 Đặt vấn đề Với phát triển mạnh mẽ công nghệ thông tin trang thông tin điện tử trực tuyến năm gần đây, người có khả nắm bắt thơng tin báo chí cách nhanh chóng thơng qua trạng mạng điện tử Hiện nay, kể nhiều trang báo điện tử có nhiều lượt truy cập thanhnien, tuoitre, 24h, … Do thơng tin truyền tải ngày nhiều đa dạng chủ đề khác sống thông tin thể thao, thời sự, đời sống hàng ngày Thông thường tin trang mạng phân loại theo chủ đề khác theo nội dung mà báo đề cập đến Các chủ đề báo người biên tập gán nhãn phân loại theo chủ đề viết Một ứng dụng bật ứng dụng phân loại văn hệ thông lọc thư rác Google áp dụng để loại bỏ thư có nội dung spam để làm email cho người dùng sử dụng tránh bị nhiễu Đây ví dụ điển hình cho thấy tầm quan trọng hệ thống ứng dụng toán phân loại văn thực tế Ngoài ứng dụng phân loại văn hỗ trợ tốn khác xử lý ngơn ngữ Tự nhiên khác tìm kiếm văn bản, tóm tắt văn Hình 0.1 Bài tốn phân loại thư rác (spam classification) ứng dụng email Trên giới có nhiều cơng trình nghiên cứu toán đạt kết khả quan, ngôn ngữ giàu tài nguyên tiếng Anh, tiếng Trung Quốc hay tiếng Pháp Đối với tiếng Việt hạn chế chưa nhiều cơng trình nghiên cứu phát [19] Ashish Vaswani et al “Attention is all you need” In: Advances in neural information processing systems 2017, pp 5998–6008 [20] Zhang, Ye, and Byron Wallace "A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification." arXiv preprint arXiv:1510.03820 (2015) [21] Hochreiter, Sepp, and Jürgen Schmidhuber "Long short-term memory." Neural computation 9, no (1997): 1735-1780 [22] Sun, Chi, Xipeng Qiu, Yige Xu, and Xuanjing Huang "How to fine-tune bert for text classification?." In China national conference on Chinese computational linguistics, pp 194-206 Springer, Cham, 2019 [23] Nguyen, Luan Thanh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen "SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese." arXiv preprint arXiv:2209.10482 (2022) [24] Loc, Cu Vinh, et al "A Text Classification for Vietnamese Feedback via PhoBERT-Based Deep Learning." Proceedings of Seventh International Congress on Information and Communication Technology Springer, Singapore, 2023 [25] Quoc Tran, K., Trong Nguyen, A., Hoang, P G., Luu, C D., Do, T H., & Van Nguyen, K (2022) Vietnamese hate and offensive detection using PhoBERT-CNN and social media streaming data Neural Computing and Applications, 1-22 [26] Huynh, H T., Duong-Trung, N., Truong, D Q., & Huynh, H X (2020) Vietnamese text classification with textrank and jaccard similarity coefficient Adv Sci Technol Eng Syst, 5(6) PHỤ LỤC 1: MÃ CODE CÁC CHƯƠNG TRÌNH P1.1 Mơ hình máy học truyền thống # -*- coding: utf-8 -*- import os, string, re import glob from pyvi import ViTokenizer, ViPosTagger # Đặt biến đường dẫn liệu path_train ='Train_Full/Train_Full' path_test ='Test_Full/Test_Full' path_stopwords= 'vietnamese-stopwords/vietnamese-stopwords.txt' # Các bước tiền xử lý văn def normalText(sent): patHagTag = r'#\s?[aăâbcdđghiklmnơpqrstuưvxằầbcdđèềghìklmnịồờpqrstùừvxỳáắấbcdđéếghí klmnóốớpqrstúứvxýảẳẩbcdđẻểghỉklmnỏổởpqrstủửvxỷạặậbcdđẹệghịklmnọộợpqrstụựvxỵãẵẫbcdđẽễghĩkl mnõỗỡpqrstũữvxỹAĂÂBCDĐGHIKLMNƠPQRSTUƯVXẰẦBCDĐÈỀGHÌKLMNỊỒỜPQ RSTÙỪVXỲÁẮẤBCDĐÉẾGHÍKLMNĨỐỚPQRSTÚỨVXÝẠẶẬBCDĐẸỆGHỊKLMNỌỘỢPQRST ỤỰVXỴẢẲẨBCDĐẺỂGHỈKLMNỎỔỞPQRSTỦỬVXỶÃẴẪBCDĐẼỄGHĨKLMNÕỖỠPQRSTŨỮ VXỸ]+' patURL = r"(?:http://|www.)[^\"]+" sent = re.sub(patURL,'website',sent) sent = re.sub(patHagTag,' hashtag ',sent) sent = re.sub('\.+','.',sent) sent = re.sub('(hashtag\\s+)+',' hagtag ',sent) sent = re.sub('\\s+',' ',sent) return sent # Tách từ def tokenizer(text): token = ViTokenizer.tokenize(text) return token with open(path_stopwords,"r",encoding="utf8") as file: stopwords_lists = file.read().split("\n") def remove_stopwords(text): for stopword in stopwords_lists: text.replace(stopword.strip().replace(" ", "_"), "") text = re.sub('\\s+',' ',text) return text.strip() def clean_doc(doc): # Tách dấu câu khỏi chữ trước tách từ for punc in string.punctuation: doc = doc.replace(punc,' '+ punc + ' ') # Thay link web = website, hagtag ~ hagtag doc = normalText(doc) # Tách từ liệu doc = tokenizer(doc) # Đưa tất chữ thường doc = remove_stopwords(doc) doc = doc.lower() # Xóa nhiều khoảng trắng thành khoảng trắng doc = re.sub(r"\?", " \? ", doc) # Thay giá trị số thành ký tự num doc = re.sub(r"[0-9]+", " num ", doc) # Xóa bỏ dấu câu khơng cần thiết for punc in string.punctuation: if punc !="_": doc = doc.replace(punc,' ') doc = re.sub('\\s+',' ',doc) return doc # Hàm đọc liệu def read_data(folder_path,threshold=1000): documents = [] labels = [] #print(os.listdir(folder_path)) for category in os.listdir(folder_path): count = print(category) path_new = folder_path+ "/"+category + "/*.txt" for filename in glob.glob(path_new): with open(filename,'r',encoding="utf-16") as file: content = file.read() documents.append(content) labels.append(category) count +=1 else: break return documents, labels X_train, y_train = read_data(path_train) X_test, y_test = read_data(path_test) print(len(X_test),len(y_test)) # Đọc liệu lên chương trình X_train, y_train = read_data(path_train) X_test, y_test = read_data(path_test) # Danh sach cac nhãn liệu list_label = ["Chinh tri Xa hoi","Doi song","Khoa hoc","Kinh doanh","Phap luat","Suc khoe","The gioi","The thao","Van hoa","Vi tinh"] print("So luong nhãn: ", len(list_label)) # Đoạn code đọc liệu tập train chạy hàm tiền xử lý # Đối với nhãn văn chuyển thành dạng số với giá trị tương ứng theo index list_l abel, # Ví dụ nhãn "Chinh tri xa hoi" chuyển thành 0, "Khoa hoc" X_train_proceed = [] y_train_encoded = [] X_test_proceed = [] y_test_encoded = [] for index,data in enumerate(X_train): X_train_proceed.append(clean_doc(data)) y = list_label.index(y_train[index]) y_train_encoded.append([y]) for index,doc in enumerate(X_test): X_test_proceed.append(clean_doc(doc)) y = list_label.index(y_test[index]) y_test_encoded.append([y]) # Biểu diễn văn thành vector dựa theo số TFIDF đưa vào mơ hình Random Forrest để huấn luyện from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1)) X_train_tfidf = vectorizer.fit_transform(X_train_proceed) print("Tổng số từ vựng đánh số TF-IDF") print(vectorizer.get_feature_names()) print("Kích thước vector dựa biểu diễn TF-IDF: ", X_train_tfidf.shape) # Khởi tạo mơ hình LinearSVC để huấn luyện mơ hình tập huấn luyện from sklearn.svm import LinearSVC classifier = LinearSVC(random_state=42) classifier.fit(X_train_tfidf, y_train_encoded) print("Mơ hình huấn luyện xong") # Đoạn code chuyển liệu tập test thành vector tfidf sau dùng mơ hình classifier để dựa đốn # Sau sử dụng độ đo Accuracy, F1-score, Precision, Recall để đánh giá hiệu mơ hình from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_repo rt, confusion_matrix X_test_tfidf = vectorizer.transform(X_test_proceed) # Dưa vector tf-idf tập test vào mơ hình dự đoán predicted = classifier.predict(X_test_tfidf) y_pred = [] for y in predicted: y_pred.append([int(y)]) print("Kết độ đo mô hình SVM") print("Accuracy: ",accuracy_score(y_test_encoded, y_pred)) print("F1 - score: ",f1_score(y_test_encoded, y_pred, average="weighted")) print("Precision: ",precision_score(y_test_encoded, y_pred, average="weighted")) print("Recall: ",recall_score(y_test_encoded, y_pred, average="weighted")) # Kết dự đoán nhãn print(classification_report(y_test_encoded,y_pred,target_names=list_label)) Các bước tiền xử lý liệu đọc liệu giống mơ hình máy học Sau học viên thích đoạn mã code khởi tạo mơ hình cài đặt # Chuẩn hóa nhãn đầu xLengths = [len(x.split(' ')) for x in X_train_proceed] maxLength = xLengths[0] print(maxLength) tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="", oov_token='') tokenizer.fit_on_texts(X_train_proceed) word_index = tokenizer.word_index input_vocab_size = len(tokenizer.word_index) + print("input_vocab_size:",input_vocab_size) train_seqs=tokenizer.texts_to_sequences(X_train_proceed) text_seqs=tokenizer.texts_to_sequences(X_test_proceed) train_ids = pad_sequences(train_seqs, maxlen=maxLength, dtype="long", value=0, truncating="post", p adding="post") test_ids = pad_sequences(text_seqs, maxlen=maxLength, dtype="long", value=0, truncating="post", pa dding="post") # Khởi tạo mô hình CNN import pickle import tensorflow as tf import tensorflow_addons as tfa filter_nums = 128 EMBEDDING_DIM = 300 number_class = len(list_label) def build_model(sequence_max_length=maxLength): inputs = tf.keras.layers.Input(shape=(sequence_max_length, ), dtype='float64') embedding_layer = tf.keras.layers.Embedding(input_vocab_size,EMBEDDING_DIM,input_length =sequence_max_length, trainable=True)(inputs) embedding_layer = tf.keras.layers.SpatialDropout1D(0.5)(embedding_layer) conv_1 = tf.keras.layers.Conv1D(filter_nums, 2, padding="same", activation="relu",kernel_initiali zer='he_normal',trainable=True)(embedding_layer) conv_2 = tf.keras.layers.Conv1D(filter_nums, 3, padding="same", activation="relu",kernel_initiali zer='he_normal',trainable=True)(embedding_layer) conv_3 = tf.keras.layers.Conv1D(filter_nums, 4, padding="same", activation="relu",kernel_initiali zer='he_normal',trainable=True)(embedding_layer) maxpool_1 = tf.keras.layers.GlobalMaxPooling1D()(conv_1) maxpool_2 = tf.keras.layers.GlobalMaxPooling1D()(conv_2) maxpool_3 = tf.keras.layers.GlobalMaxPooling1D()(conv_3) v0_col = tf.keras.layers.Concatenate(axis=1)([maxpool_1, maxpool_2,maxpool_3]) output = tf.keras.layers.Dense(number_class,name="output1", activation='sigmoid')(v0_col) model = tf.keras.Model(inputs=inputs, outputs=output) opt = tfa.optimizers.RectifiedAdam(lr=1e3, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5) loss = tf.keras.losses.CategoricalCrossentropy() model.compile(loss=loss, optimizer="adam", metrics = "accuracy") print(model.summary()) checkpoint_filepath = '/tmp/checkpoint' model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_filepath, save_weights_only=True, monitor='val_accuracy', mode='max', save_best_only=True) callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10) history = model.fit(train_ids, y_train_vector, validation_data = (test_ids,y_test_vector),batch_size= 32, epochs=50, callbacks=[callback]) #model.load_weights(checkpoint_filepath) return model,history model,history = build_model() #Mơ hình LSTM import pickle import tensorflow as tf import tensorflow_addons as tfa filter_nums = 128 EMBEDDING_DIM = 300 number_class = len(list_label) def build_model(sequence_max_length=maxLength): inputs = tf.keras.layers.Input(shape=(sequence_max_length, ), dtype='float64') embedding_layer = tf.keras.layers.Embedding(input_vocab_size,EMBEDDING_DIM,input_length =sequence_max_length, trainable=True)(inputs) embedding_layer = tf.keras.layers.SpatialDropout1D(0.5)(embedding_layer) v0_col = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(embedding_layer) output = tf.keras.layers.Dense(number_class,name="output1", activation='sigmoid')(v0_col) model = tf.keras.Model(inputs=inputs, outputs=output) opt = tfa.optimizers.RectifiedAdam(lr=1e3, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5) loss = tf.keras.losses.CategoricalCrossentropy() model.compile(loss=loss, optimizer="adam", metrics = "accuracy") print(model.summary()) checkpoint_filepath = '/tmp/checkpoint' model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_filepath, save_weights_only=True, monitor='val_accuracy', mode='max', save_best_only=True) callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10) history = model.fit(train_ids, y_train_vector, validation_data = (test_ids,y_test_vector),batch_size= 32, epochs=50, callbacks=[callback]) #model.load_weights(checkpoint_filepath) return model model = build_model() # Hiển thị biểu đồ giá trị độ xác giá trị hàm loss trình huấn luyện # list all data in history print(history.history.keys()) # summarize history for accuracy plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'Validation'], loc='upper left') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'validation'], loc='upper left') plt.show() # Chạy dự đoán tập kiểm tra y_pred = [] for index,item in enumerate(test_ids): predicted = model.predict(np.expand_dims(item, axis=0)) y_classes = predicted.argmax(axis=-1) y_pred.append(y_classes[0]) print(y_pred) print("Kết mơ hình CNN") print("Độ xác: ", precision_score(y_test_encoded,y_pred,average="weighted")) print("Độ phủ: ", recall_score(y_test_encoded,y_pred,average="weighted")) print("F1-score: ", f1_score(y_test_encoded,y_pred,average="weighted")) # Hiển thị giá trị nhãn print(classification_report(y_test_encoded,y_pred)) P1.2 Mô hình BERT # Sử dụng phobert tokenizer để biểu diễn thành đầu vào mơ hình model_name = "vinai/phobert-base" tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) train_encodings = tokenizer(X_train_proceed,truncation=True, padding=True) print(len(train_encodings["input_ids"])) print(len(train_encodings["attention_mask"])) max_length = len(train_encodings["input_ids"][0]) print(max_length) # Convert giá trị dạng numpy train_ids = np.array(train_encodings["input_ids"]) train_masks = np.array(train_encodings["attention_mask"]) train_token_type_ids = np.array(train_encodings["token_type_ids"]) test_encodings = tokenizer(X_test_proceed, max_length=max_length, truncation=True, padding="max_ length") print(max_length) test_ids = np.array(test_encodings["input_ids"]) test_masks = np.array(test_encodings["attention_mask"]) test_token_type_ids = np.array(test_encodings["token_type_ids"]) # Khởi tạo mơ hình BERT model_name = "vinai/phobert-base" transformer_model = TFAutoModel.from_pretrained(model_name) # Khởi tạo mơ hình BERT để huấn luyện mơ hình import pickle import tensorflow as tf import tensorflow_addons as tfa number_class = len(list_label) def build_model(sequence_max_length=max_length): input_ids_in = tf.keras.layers.Input(shape=(max_length,), name='input_ids', dtype='int32') input_masks_in = tf.keras.layers.Input(shape=(max_length,), name='input_mask', dtype='int32') outputs = transformer_model(input_ids_in,attention_mask = input_masks_in)[0] cls_token = outputs[:,0,:] #fc1 = tf.keras.layers.Dense(1024, name="fc1", activation='relu')(cls_token) output = tf.keras.layers.Dense(number_class,name="output1", activation='softmax')(cls_token) model = tf.keras.Model(inputs=[input_ids_in,input_masks_in], outputs=output) opt = tfa.optimizers.RectifiedAdam(lr=5e5, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5) loss = tf.keras.losses.CategoricalCrossentropy() model.compile(loss=loss, optimizer=opt, metrics = "accuracy") checkpoint_filepath = '/tmp/checkpoint' model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_filepath, save_weights_only=True, monitor='val_accuracy', mode='max', save_best_only=True) history = model.fit([train_ids,np.array(train_masks)], y_train_vector, validation_data = ([test_ids,n p.array(test_masks)],y_test_vector),batch_size=16, epochs=5, callbacks=[model_checkpoint_callback]) model.load_weights(checkpoint_filepath) return model,history model,history = build_model() # Hiển thị biểu đồ giá trị độ xác giá trị hàm loss trình huấn luyện # list all data in history print(history.history.keys()) # summarize history for accuracy plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'Validation'], loc='upper left') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'validation'], loc='upper left') plt.show() # Dự đoán tập kiểm tra y_predicted = model.predict(test_ids,test_masks,batch_size=16) y_pred = [] for index,item in enumerate(y_predicted): y_classes = item.argmax(axis=-1) y_pred.append(y_classes[0]) print(y_pred) # Hiển thị thơng tin kết dự đốn tập train tập test print(y_pred) print(y_test) print("Kết mơ hình PhoBERT") print("Độ xác: ", precision_score(y_test_encoded,y_pred,average="weighted")) print("Độ phủ: ", recall_score(y_test_encoded,y_pred,average="weighted")) print("F1-score: ", f1_score(y_test_encoded,y_pred,average="weighted")) # Hiển thị giá trị nhãn print(classification_report(y_test_encoded,y_pred)) #Hiển thị ma trận nhầm lẫn cm = confusion_matrix(y_test_encoded, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot() PHỤ LỤC 2: CHƯƠNG TRÌNH MINH HỌA PL2.1 Ngơn ngữ lập trình Để xây dựng chương trình minh họa cho tốn Phân loại chủ đề văn tin tức, luận văn sử dụng Flask framework ngôn ngữ Python kết hợp với kỹ thuật HTML, CSS để xây dựng ứng dụng minh họa website Thông tin chi tiết lý luận văn lựa chọn ngôn ngữ trình bày sau:  Ngơn ngữ lập trình Python: Python có cú pháp đơn giản, rõ ràng Nó dễ đọc viết nhiều so sánh với ngơn ngữ lập trình khác C++, Java, C# Python làm cho việc lập trình trở nên thú vị, cho phép bạn tập trung vào giải pháp cú pháp  Thư viện máy học: Trong đề tài này, nhóm nghiên cứu sử dụng hai thư viện máy học bao gồm sklearn tensorflow keras Đây thư viện máy học phổ biến Python viết sẵn thuật tốn lằng nhằng phức tạp cơng nghệ phân tích liệu Nó cung cấp nhiều thuật tốn học tập cho phép hồi quy, phân cụm phân loại Website gồm có hai chức là: + Giao diện cho người dùng nhập trực tiếp đoạn văn vào + Giao diện cho người dùng upload file định dạng txt lên minh họa + Giao diện hiển thị thông tin liệu nghiên cứu đề tài + Giao diện thị thông tin kết nghiên cứu mơ hình sử dụng đề tài PL2.2 Mô tả chi tiết chức trang thông tin Sau luận văn mô tả chi tiết chức trang thông tin liên quan đến đề tài Hình 5.1 Trang giao diện chương trình ứng dụng minh họa phân loại tin tức văn tiếng Việt Trong trang chủ ứng dụng thị thông tin luận văn trường Đại học, giảng viên hướng dẫn thơng tin học viên thực đề tài.Ở Hình 5.2 Trang giao diện người dùng nhập vào nội dung văn tin tức để mơ hình xác định chủ đề văn Trang giao diện bao gồm ô phép người dùng nhập đoạn văn Hình 5.3 Kết giao diện ứng dụng người dùng bấm nút dự đoán hệ thống phân loại nhãn dự đoán chủ đề cho văn đầu vào Kết giao diện cho thấy mơ hình dự đoán nhãn chủ đề cho chủ đề “Thể thao” Tiếp theo sau chúng tơi Hình 5.4 giao diện cho phép người dùng nhập vào file liệu định dạng txt để dự đoán xem chủ đề tập tin đầu vào Hình 5.5 giao diện người dùng thao bấm nút “upload” chọn tập tin example1.txt lên để tiến hành dự đoán chủ đề cho văn tập tin txt Hình 5.6 Kết dự đốn hệ thống nhập đầu vào tập tin example1.txt trả kết thị cho người dùng nội dung liệu tập tin kết dự đốn mơ hình Hình 5.7 Trang thơng tin liệu huấn luyện liệu kiểm tra ứng dụng nghiên cứu Thông tin thị liệu cho nhãn chủ đề hai tập liệu Hình 5.8 Thơng tin kết thử nghiệm mơ hình đề tài ... TRƯỜNG ĐẠI HỌC LẠC HỒNG NGHIÊN CỨU CÁC PHƯƠNG PHÁP HỌC SÂU CHO BÀI TOÁN PHÂN LOẠI VĂN BẢN TIN TỨC TIẾNG VIỆT Chuyên ngành: Công nghệ thông tin Mã số: 8480201 LUẬN VĂN THẠC SĨ CÔNG NGHỆ THÔNG TIN ĐỒNG... nghiên cứu Để hoàn thiện đề tài luận văn thạc sĩ với tiêu đề ? ?Nghiên cứu phương pháp học sâu cho toán phân loại văn tin tức tiếng Việt? ??, học viên cần áp dụng phương pháp nghiên cứu sau: - Nghiên cứu. .. tài ? ?Nghiên cứu phương pháp học sâu cho toán phân loại văn tin tức tiếng Việt? ?? cụ thể sau: - Tìm hiểu phương pháp phân tích ngơn ngữ tự nhiên thử nghiệm phương pháp học sâu để chọn phương pháp

Định dạng
Số trang	72
Dung lượng	3,09 MB