Đánh giá bộ cơ sở dữ liệu trong phân loại rối loạn phổ tự kỷ sử dụng thuật toán SVM, random forest

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC NHA TRANG PHẠM QUANG THUẬN ĐÁNH GIÁ BỘ CƠ SỞ DỮ LIỆU TRONG PHÂN LOẠI RỐI LOẠN PHỔ TỰ KỶ SỬ DỤNG THUẬT TOÁN SVM, RANDOM FOREST LUẬN VĂN THẠC SĨ KHÁNH HÒA – 2019 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC NHA TRANG PHẠM QUANG THUẬN ĐÁNH GIÁ BỘ CƠ SỞ DỮ LIỆU TRONG PHÂN LOẠI RỐI LOẠN PHỔ TỰ KỶ SỬ DỤNG THUẬT TOÁN SVM, RANDOM FOREST LUẬN VĂN THẠC SĨ Ngành: Công nghệ thông tin Mã số: 8480201 Quyết định giao đề tài: 453/QĐ-ĐHNT ngày 04/5/2019 Quyết định thành lập HĐ: 1523/QĐ-ĐHNT ngày 27/11/2019 Ngày bảo vệ: 23/12/2019 Người hướng dẫn khoa học: PGS.TS NGUYỄN ĐÌNH THUÂN Chủ tịch Hội đồng: TS NGUYỄN THIÊN CHƯƠNG Phòng ĐT Sau Đại học: KHÁNH HÒA - 2019 LỜI CAM ĐOAN Tôi xin cam đoan kết đề tài: “Đánh giá sở liệu phân loại rối loạn phổ tự kỷ sử dụng thuật toán SVM, Random Forest” cơng trình nghiên cứu cá nhân chưa công bố cơng trình khoa học khác thời điểm Khánh Hòa, ngày 15 tháng 10 năm 2019 Tác giả luận văn Phạm Quang Thuận iii LỜI CẢM ƠN Để hồn thành đề tài luận văn thạc sĩ cách hoàn chỉnh, bên cạnh nỗ lực cố gắng thân cịn có hướng dẫn nhiệt tình q Thầy Cơ động viên, ủng hộ gia đình bạn bè suốt thời gian học tập nghiên cứu Với tình cảm chân thành, tơi bày tỏ lịng biết ơn Ban giám hiệu, phịng Sau Đại học, Khoa Cơng nghệ thông tin, quý Thầy Cô đại học Nha Trang tham gia quản lý, giảng dạy giúp đỡ tơi suốt q trình học tập, nghiên cứu Tơi xin chân thành cảm ơn Ban giám hiệu, phòng, ban, khoa, trung tâm trường Cao đẳng Sư phạm Trung ương – Nha Trang quan tâm giúp đỡ, tạo điều kiện thuận lợi để tơi hồn thành khóa học Tơi xin bày tỏ biết ơn đặc biệt đến thầy PGS.TS Nguyễn Đình Thuân – người trực tiếp hướng dẫn, giúp đỡ kiến thức, tài liệu phương pháp nghiên cứu để tơi hồn thành đề tài nghiên cứu khoa học Tôi xin cảm ơn NCS.ThS Vũ Duy Chinh, ThS Nguyễn Văn Chí, ThS Vũ Thị Thúy trường Cao đẳng Sư phạm Trung ương – Nha trang chia sẻ, cung cấp cho kiến thức liệu trẻ mắc chứng ASD phục vụ trình nghiên cứu Cuối xin gửi lời cảm ơn chân thành đến gia đình tất bạn bè giúp đỡ, động viên tơi suốt q trình học tập thực đề tài Mặc dù có nhiều cố gắng suốt trình thực đề tài, song cịn có mặt hạn chế, thiếu sót Tơi mong nhận ý kiến đóng góp dẫn quý Thầy Cô bạn đồng nghiệp Tơi xin chân thành cảm ơn! Khánh Hịa, ngày 15 tháng 10 năm 2019 Tác giả luận văn Phạm Quang Thuận iv MỤC LỤC LỜI CAM ĐOAN iii LỜI CẢM ƠN iv MỤC LỤC .v DANH MỤC CÁC CHỮ VIẾT TẮT vii DANH MỤC BẢNG BIỂU viii DANH MỤC HÌNH ix TRÍCH YẾU LUẬN VĂN x Chương TỔNG QUAN 1.1 Lý chọn đề tài 1.2 Mục tiêu nghiên cứu đề tài luận văn 1.3 Đối tượng nghiên cứu .3 1.4 Phạm vi nghiên cứu 1.5 Phương pháp nghiên cứu 1.6 Kết cấu luận văn .4 Chương CỞ SỞ LÝ LUẬN VÀ THỰC TIỄN 2.1 ASD phương pháp sàng lọc ASD 2.1.1 ASD .5 2.1.2 Các phương pháp sàng lọc ASD 2.2 Ứng dụng học máy phân loại ASD 2.2.1 Tổng quan ML 2.2.2 Đánh giá mơ hình học máy 11 2.2.3 Ứng dụng học máy phân loại ASD 14 v 2.2.4 Mơ hình học máy phân loại ASD 15 Chương ĐÁNH GIÁ BỘ CƠ SỞ DỮ LIỆU TRONG PHÂN LOẠI RỐI LOẠN PHỔ TỰ KỶ SỬ DỤNG THUẬT TOÁN SVM, RANDOM FOREST 17 3.1 Giới thiệu liệu ASD 17 3.2 Bộ liệu ASD 19 3.2.1 Bộ liệu huấn luyện 19 3.2.2 Bộ liệu kiểm định 22 3.2.3 Làm liệu 29 3.3 Trích chọn đặc trưng .32 3.4 Xây dựng mơ hình 33 3.4.1 Các thuật toán học máy .33 3.4.2 Kết xây dựng mơ hình 38 3.5 Sử dụng mô hình SVM phân loại ASD liệu thực tế 40 3.6 Chương trình sàng lọc ASD 41 Chương KẾT LUẬN VÀ KHUYẾN NGHỊ 42 4.1 Kết luận 42 4.2 Khuyến nghị 43 DANH MỤC TÀI LIỆU THAM KHẢO 44 PHỤ LỤC vi DANH MỤC CÁC CHỮ VIẾT TẮT ASD : Autism Spectrum Disorder (Rối loạn phổ tự kỷ) ML : Machine Learning (Học máy) CHAT : Check – list for Autism in Toddlers (Bảng kiểm sàng lọc tự kỷ trẻ nhỏ) M-CHAT 23 : Modifier Check – list for Autism in Toddlers (Bảng kiểm sàng lọc tự kỷ trẻ nhỏ có sửa đổi) CARS : Childhood Autism Rating Scale (Thang chẩn đoán tự kỷ tuổi ấu thơ) ADI-R : The Autism Diagnostic Interview – Revised (Bảng vấn chẩn đoán tự kỷ có điều chỉnh) ADOS : The Autism Diagnostic Observation Schedule (Bảng quan sát chẩn đoán tự kỷ) GARS : Gilliam Autism Rating Scale (Thang đánh giá tự kỷ Gilliam) AQ : Autism Spectrum Quotient (Thang đo rối loạn phổ tự kỷ) Q-CHAT : Quantitative Checklist for Autism in Toddlers (Bảng kiểm đo lường sàng lọc tự kỷ trẻ nhỏ) UCI : UCI Machine Learning Repository (Kho liệu học máy đại học : California, Irvine) SVM : Support vector machine (Máy véc tơ hỗ trợ) AGRE : Autism Genetic Resource Exchange (Bộ liệu gen tự kỷ) AC : Boston Autism Consortium (Bộ liệu tự kỷ Hiệp hội tự kỷ Boston) TP : True positive (Số lượng đối tượng xác thuộc lớp dương phân vào lớp dương) TN : True negative (Số lượng đối tượng xác thuộc lớp âm phân vào lớp âm) FP : False positive (Số lượng đối tượng xác khơng thuộc lớp dương phân vào lớp dương) FN : Fale negative (Số lượng đối tượng xác thuộc lớp dương phân vào lớp âm) vii DANH MỤC BẢNG BIỂU Bảng 2.1 Các nghiên cứu ứng dụng học máy phân loại ASD 14 Bảng 3.1 Thông tin liệu Autistic Spectrum Disorder Screening Data for Children Data Set 20 Bảng 3.2 Mô tả thuộc tính liệu 20 Bảng 3.3 Bộ câu hỏi sàng lọc AQ-10 Child ASD Test 23 Bảng 3.4 Dữ liệu đánh giá chuyên gia 27 Bảng 3.5 Kỹ thuật one-hot encode .30 Bảng 3.6 Các lớp cài đặt thuật tốn ML gói thư viện Python .37 Bảng 3.7 Kết thử nghiệm giải thuật ML với liệu đầy đủ đặc trưng 38 Bảng 3.8 Kết thử nghiệm giải thuật ML với liệu 10 đặc trưng .38 viii DANH MỤC HÌNH Hình 2.1 Mơ hình chung tốn ML Hình 2.2 Ma trận nhầm lẫn 11 Hình 2.3 Mơ hình phân loại ASD sử dụng học máy 16 Hình 3.1 Giao diện ứng dụng ASD Tests .17 Hình 3.2 Sơ đồ điều hướng sử dụng ứng dụng ASD Tests 18 Hình 3.3 Bộ câu hỏi AQ-10 dùng cho sàng lọc ASD trẻ em .19 Hình 3.4 Hình ảnh 20 mẫu liệu Autistic Spectrum Disorder Screening Data for Children Data Set 22 Hình 3.5 Kết xử lý liệu thực tế trẻ mắc chứng ASD chun gia 29 Hình 3.6 Ví dụ việc định dựa câu hỏi 34 Hình 3.7 Mơ hình SVM 35 Hình 3.8 Mơ hình MLP 36 Hình 3.9 Quy trình xây dựng, đánh giá sở liệu ASD 37 Hình 3.10 Đồ thị so sánh độ xác thuật toán liệu với đầy đủ đặc trưng .39 Hình 3.11 Đồ thị so sánh độ xác thuật toán liệu với 10 đặc trưng 39 Hình 3.12 Kết phân loại liệu với mơ hình huấn luyện với thuật toán SVM .40 Hình 3.13 Giao diện hiển thị kết sàng lọc trường hợp trẻ không mắc chứng ASD .41 Hình 3.14 Giao diện hiển thị kết sàng lọc trường hợp trẻ trẻ mắc chứng ASD .41 ix TRÍCH YẾU LUẬN VĂN Nghiên cứu nhằm đánh giá sở liệu phân loại rối loạn phổ tự kỷ (ASD) sử dụng thuật toán SVM, Random Forest Mục đích nghiên cứu từ tập liệu sàng lọc ASD, rút trích đặc trưng cần thiết đánh giá hiệu thuật toán SVM, Random Forest phân loại ASD Nghiên cứu sử dụng liệu sàng lọc ASD trẻ em giáo sư Fadi Fayez Thabtab, Đại học Auckland, New Zealand công bố kho liệu UCI vào tháng 12 năm 2017 để xây dựng mô hình học máy Bộ liệu xây dựng dựa công cụ sàng lọc ASD AQ-10 thu thập thông qua ứng dụng ASD Tests Từ liệu, nghiên cứu sử dụng phương pháp trích chọn đặc trưng Chi Square Với phương pháp này, nghiên cứu thu 10 đặc trưng quan trọng ảnh hưởng tới trình phân loại ASD Sau trích chọn đặc trưng, nghiên cứu sử dụng thuật toán học máy SVM, Random Forest khảo sát thêm số thuật toán Decision Trees, Logistic Regression, K-Nearest Neighbors, Naive Bayes Classification, Multi Layer Perception Qua thử nghiệm thuật toán cho kết phân loại ASD cao phù hợp với nghiên cứu khảo sát trước Đặc biệt, thuật tốn SVM cho kết phân loại ASD tốt Nghiên cứu xây dựng liệu kiểm nghiệm thực tế với 10 trường hợp trẻ mắc chứng ASD (đã chẩn đoán lâm sàng) thông qua hỗ trợ chuyên gia Giáo dục Đặc biệt trường Cao đẳng Sư phạm Trung ương - Nha Trang Các chuyên gia sử dụng bảng hỏi AQ-10 cho trẻ em ứng dụng ASD Tests để đánh giá trường hợp thực tế Bộ liệu sau kiểm thử với mơ hình thuật tốn SVM cho kết phân loại ASD đạt 100% Từ kết đạt được, tác giả kết luận liệu sàng lọc ASD trẻ em dùng để xây dựng mơ hình phân loại ASD đáng tin cậy Thuật toán SVM cho kết phân loại ASD tốt liệu sàng lọc ASD trẻ em Tác giả xây dựng chương trình giúp sàng lọc ASD trẻ em dựa mơ hình thuật toán SVM Tác giả đề xuất số hướng phát triển tiếp tục xây dựng liệu sàng lọc ASD phát triển ứng dụng hỗ trợ sàng lọc ASD cho trẻ Việt Nam Từ khóa: Rối loạn phổ tự kỷ, sàng lọc rối loạn phổ tự kỷ, thuật toán học máy x PHỤ LỤC Phụ lục Các hàm xây dựng mơ hình với liệu ASD 10 đặc trưng #1.1 Thuat toan DecisionTree def DecisionTree(): from sklearn import tree from sklearn.tree import DecisionTreeClassifier # # X_train, X_test, y_train, y_test= DocDuLieu() dectree = DecisionTreeClassifier(random_state=1) dectree.fit(X_train, y_train) # Luu Mo Hinh f_dectree_model = 'dectree_Model_model.sav' pickle.dump(dectree, open(f_dectree_model, 'wb')) # # print('Kết đánh giá mơ hình với thuật tốn DecisionTree.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(dectree, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(dectree, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = dectree.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',Accuracy ) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier Decision Trees') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # -# #1.2 Thuat toan RandomForest def RandomForest(): from sklearn.ensemble import RandomForestClassifier # # X_train, X_test, y_train, y_test= DocDuLieu() ranfor = RandomForestClassifier(n_estimators=5, random_state=1) ranfor.fit(X_train, y_train) # Luu Mo Hinh f_ranfor_model = 'ranfor_Model_model.sav' pickle.dump(ranfor, open(f_ranfor_model, 'wb')) # # print('Kết đánh giá mơ hình với thuật tốn RandomForest.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(ranfor, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(ranfor, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = ranfor.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier RandomForest') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # # # # #1.3 Thuat toan SVM def SVM(): from sklearn import svm # # X_train, X_test, y_train, y_test= DocDuLieu() C = 1.0 svc = svm.SVC(kernel='linear', C=C, gamma=2) svc.fit(X_train, y_train) # Lưu mơ hình f_SVM_Model = 'SVM_model.sav' pickle.dump(svc, open(f_SVM_Model, 'wb')) print('Kết đánh giá mơ hình với thuật toán SVM.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(svc, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(svc, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = svc.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier SVM') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # -# #1.4 Thuat toan Logistic Regression def LogisticRegression(): from sklearn.linear_model import LogisticRegression # # X_train, X_test, y_train, y_test= DocDuLieu() logreg = LogisticRegression() logreg.fit(X_train, y_train) # Luu Mo Hinh f_logreg_model = 'logreg_Model_model.sav' pickle.dump(logreg, open(f_logreg_model, 'wb')) print('Kết đánh giá mơ hình với thuật tốn Logistic Regression.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(logreg, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(logreg, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = logreg.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier Logistic Regression') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # -# #1.5 Thuat toan K-Nearest-Neighbors (KNN) def KNN(): from sklearn import neighbors from sklearn.metrics import classification_report # # X_train, X_test, y_train, y_test= DocDuLieu() knn = neighbors.KNeighborsClassifier(n_neighbors=10) knn.fit(X_train, y_train) f_knn_Model = 'knn_model.sav' pickle.dump(knn, open(f_knn_Model, 'wb')) print('Kết đánh giá mơ hình với Thuat toan KNN.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(knn, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(knn, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = knn.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier K-NearestNeighbors') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # # #6 Thuat toan Naive Bayes def NaiveBayes(): from sklearn.naive_bayes import MultinomialNB # # X_train, X_test, y_train, y_test= DocDuLieu() nb = MultinomialNB() nb.fit(X_train, y_train) f_nb_Model = 'knn_model.sav' pickle.dump(nb, open(f_nb_Model, 'wb')) print('Kết đánh giá mơ hình với thuật tốn NaiveBayes.') np.random.seed(1234) seed =7 num_kfolds =10 kfold = KFold(n_splits=num_kfolds, random_state=seed) # cross_val_score: cv_scores = cross_val_score(nb, X_train, y_train, cv=kfold) cv_scores.mean() print('cross_val_score: ', cv_scores.mean()) print('Cross-validated AUC: ', cross_val_score(nb, X_train, y_train, cv=kfold, scoring='roc_auc').mean()) predictions_test = nb.predict(X_test) fbeta_score(y_test, predictions_test, average='binary', beta=0.5) print('F-beta Score: ', fbeta_score(y_test, predictions_test, average='binary', beta=0.5)) #report = classification_report(y_test, predictions_test) #print(report) # Ma tran nham lan confusion_dectree = metrics.confusion_matrix(y_test,predictions_test) print(confusion_dectree) #[row, column] TP = confusion_dectree[1, 1] TN = confusion_dectree[0, 0] FP = confusion_dectree[0, 1] FN = confusion_dectree[1, 0] Accuracy = (TP + TN) / float(TP + TN + FP + FN) print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) print('Sensitivity: ',sensitivity) print('Recall: ', metrics.recall_score(y_test, predictions_test)) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions_test) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier NaiveBayes') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() # # # 1.7 MLP def MLP(): from sklearn.model_selection import StratifiedKFold from keras.models import Sequential from keras.layers import Dense from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_auc_score from sklearn.metrics import f1_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score X_train, X_test, y_train, y_test= DocDuLieu() seed=7 np.random.seed(seed) csvscore=[] kfold =StratifiedKFold(n_splits=10, shuffle =True, random_state=seed) for train, test in kfold.split(X_train,y_train): model = Sequential() model.add( Dense(12,input_dim=10, init='uniform', activation ='relu'))#Lop an dau tien 12 noron model.add( Dense(8,init='uniform', activation ='relu'))# Lop an noron model.add( Dense(1,init='uniform', activation ='sigmoid'))# noron ouput # Bien dich mo hinh model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model model.fit(X_train, y_train, nb_epoch=150, batch_size=10) # predict probabilities for test set ASD_predictions = model.predict(X_test, verbose=0) # predict crisp classes for test set ASD_classes = model.predict_classes(X_test, verbose=0) score= model.evaluate(X_train,y_train) print('%s: %.2f%%' %(model.metrics_names[1], score[1]*100)) csvscore.append(score[1]*100) print('csvscore: %.2f%% (+/- %.2f%%' %(np.mean(csvscore), np.std(csvscore))) accuracy = accuracy_score(y_test, ASD_classes) print('Accuracy: %f' % accuracy) # confusion matrix confusion_MPL = confusion_matrix(y_test, ASD_classes) print(confusion_MPL) TP = confusion_MPL[1, 1] TN = confusion_MPL[0, 0] FP = confusion_MPL[0, 1] FN = confusion_MPL[1, 0] print('Accuracy: ',(TP + TN) / float(TP + TN + FP + FN)) classification_error = (FP + FN) / float(TP + TN + FP + FN) print('Classification_error: ',classification_error) sensitivity = TP / float(FN + TP) # recall print('Sensitivity: ',sensitivity) f1 = f1_score(y_test, ASD_classes) print('F1 score: %f' % f1) specificity = TN / (TN + FP) print('Specificity: ', specificity) false_positive_rate = FP / float(TN + FP) print('False_positive_rate: ',false_positive_rate) precision = TP / float(TP + FP) print('Precision: ', precision) # ROC AUC auc = roc_auc_score(y_test, ASD_predictions) print('ROC AUC: %f' % auc) fpr, tpr, thresholds = metrics.roc_curve(y_test, ASD_predictions) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for ADS classifier MPL') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True) plt.show() Phụ lục Hàm kiểm nghiệm mơ hình liệu thực tế def Prediction(): datatest =pd.read_csv('Data-Child-VietNam.csv') array =datatest.values X_test = array[: ,0:10] #Y_test = array[: ,10] # load the model f_SVM_Model = 'SVM_model.sav' loaded_model = pickle.load(open(f_SVM_Model, 'rb')) # load Mo hinh da duoc huan luyen newdata=X_test y_new =loaded_model.predict(X_test) # Hien thi ket qua phan loai tren du lieu moi print('Kêt dự đoán liệu mới.') i=0 while i< X_test.shape[0]: print("X=%s, Predicted=%s" % (X_test[i], y_new[i])) i+=1 Phụ lục 3: Hàm chương trình sàng lọc ASD trẻ em def SangLocASD(): import numpy import pickle # Save and Load Model print('# -#') print('CHƯƠNG TRÌNH SÀNG LỌC CHỨNG RỐI LOẠN PHỔ TỰ KỶ ỨNG DỤNG MACHINE LEARNING.') print('Lưu ý: Chương trình áp dụng cho trẻ từ 11 tuổi.') print('# -#') data = NhapDacTrung() print('Các đặc trưng trẻ sàng lọc:', data) data = numpy.array(data) x= data.reshape(1,10) #print(data) # load the model f_SVM_Model = 'SVM_model.sav' loaded_model = pickle.load(open(f_SVM_Model, 'rb')) # load Mo hinh da duoc huan luyen y_new =loaded_model.predict(x) output = y_new[0] if(output==1): print('KẾT QUẢ SÀNG LỌC: Cần đưa trẻ đến bác sĩ chuyên khoa để chấn đoán lâm sàng ASD.') else: print('KẾT QUẢ SÀNG LỌC: Trẻ khơng có dấu hiệu mắc chứng ASD.') Phụ lục Bảng đánh giá chuyên gia Giáo dục Đặc biệt trẻ mắc chứng ASD chẩn đoán lâm sàng BẢNG HỎI AQ-10 DÀNH CHO TRẺ NHỎ (4-11 TUỔI) (Theo câu hỏi ứng dụng ASDTest) Họ tên trẻ: Ngày sinh: Ngày khảo sát: Đơn vị can thiệp: STT Câu hỏi Trả lời A1 She/he often notices  Definitely Agree(Hoàn toàn đồng ý) small sounds when  Slightly Agree (Hơi đồng ý) others not  Slightly Disagree(Hơi phản đối) Trẻ thường ý đến âm nhỏ  Definitely Disagree (Chắc chắn không đồng ý) người khác khơng A2 S/he usually  Definitely Agree(Hồn tồn đồng ý) concentrates more on  Slightly Agree(Hơi đồng ý) the whole picture rather  Slightly Disagree(Hơi phản đối) than the small details  Definitely Disagree (Chắc chắn không đồng ý) Trẻ thường tập trung vào toàn tranh chi tiết nhỏ A3 In a social group, s/he  Definitely Agree(Hoàn toàn đồng ý) can easily keep track of  Slightly Agree(Hơi đồng ý) several different people’s conversations  Slightly Disagree(Hơi phản đối)  Definitely Disagree (Chắc chắn khơng đồng ý) Trong nhóm xã hội, trẻ dễ dàng theo dõi hội thoại số người khác A4 S/he finds it easy to go  Definitely Agree(Hoàn toàn đồng ý) back and forth between  Slightly Agree(Hơi đồng ý) different activities  Slightly Disagree(Hơi phản đối) Trẻ dễ dàng chuyển  Definitely Disagree (Chắc chắn không đồng ý) đổi hoạt động khác A5 S/he doesn’t know how  Definitely Agree(Hoàn toàn đồng ý) to keep a conversation  Slightly Agree(Hơi đồng ý) going with his/her peers  Slightly Disagree(Hơi phản đối) Trẻ cách  Definitely Disagree (Chắc chắn khơng đồng ý) trì trị chuyện với bạn bè A6 S/he is good at social  Definitely Agree(Hoàn toàn đồng ý) chit-chat  Slightly Agree(Hơi đồng ý) Trẻ có khả giao  Slightly Disagree(Hơi phản đối) tiếp tốt  Definitely Disagree (Chắc chắn không đồng ý) A7 When s/he is read a story, s/he finds it difficult to work out the character’s intentions or feelings  Definitely Agree(Hoàn toàn đồng ý)  Slightly Agree(Hơi đồng ý)  Slightly Disagree(Hơi phản đối)  Definitely Disagree (Chắc chắn không đồng ý) Khi trẻ đọc câu chuyện, trẻ cảm thấy khó khăn việc tìm ý định hay cảm xúc nhân vật A8 When s/he was in preschool, s/he used to enjoy playing pretending games with other children  Definitely Agree(Hoàn toàn đồng ý)  Slightly Agree(Hơi đồng ý)  Slightly Disagree(Hơi phản đối)  Definitely Disagree (Chắc chắn không đồng ý) Khi cịn trường mầm non, trẻ thường thích chơi trị đóng vai với đứa trẻ khác A9 S/he finds it easy to  Definitely Agree(Hoàn toàn đồng ý) work out what someone  Slightly Agree(Hơi đồng ý) is thinking or feeling just by looking at their  Slightly Disagree(Hơi phản đối) face  Definitely Disagree (Chắc chắn khơng đồng ý) Khi nhìn vào khn mặt người khác, trẻ dễ dàng nhận suy nghĩ cảm xúc họ A10 S/he finds it hard to  Definitely Agree(Hoàn toàn đồng ý) make new friends  Slightly Agree(Hơi đồng ý) Trẻ thấy khó khăn  Slightly Disagree(Hơi phản đối) việc kết bạn  Definitely Disagree (Chắc chắn không đồng ý) 11 Age (Tuổi 4-11 tuổi) 12 Gender ( Giới tính)  Male(Nam) 13 Was your child born  Yes (Có) with jaundice?  Female(Nữ  No(khơng) Có phải trẻ sinh bị vàng da? 14 Has anyone in the  Yes (Có) immediate family been diagnosed with autism?  No(khơng) Có gia đình chẩn đoán tự kỷ chưa? 15 Who is completing this  Parent(Bố mẹ) test?  Relative (Người thân) Ai người thực  Health care professional (chuyên gia chăm sóc kiểm tra này? sức khỏe)  Other(khác): 16 Tình trạng  Mắc Rối loạn phổ tự kỷ trẻ (đã chẩn đốn  Khơng mắc Rối loạn phổ tự kỷ chuyên gia lâm Khác: sàng) Các thơng tin trả lời nhằm mục đích phục vụ nghiên cứu khoa học! NGƯỜI THỰC HIỆN (Ký, ghi rõ họ tên) ...BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC NHA TRANG PHẠM QUANG THUẬN ĐÁNH GIÁ BỘ CƠ SỞ DỮ LIỆU TRONG PHÂN LOẠI RỐI LOẠN PHỔ TỰ KỶ SỬ DỤNG THUẬT TOÁN SVM, RANDOM FOREST LUẬN VĂN THẠC... đoán ASD 16 Chương ĐÁNH GIÁ BỘ CƠ SỞ DỮ LIỆU TRONG PHÂN LOẠI RỐI LOẠN PHỔ TỰ KỶ SỬ DỤNG THUẬT TOÁN SVM, RANDOM FOREST 3.1 Giới thiệu liệu ASD 3.1.1 Giới thiệu Hiện nay, có liệu ASD liên quan... nhằm đánh giá sở liệu phân loại rối loạn phổ tự kỷ (ASD) sử dụng thuật toán SVM, Random Forest Mục đích nghiên cứu từ tập liệu sàng lọc ASD, rút trích đặc trưng cần thiết đánh giá hiệu thuật toán

Tiêu đề	Đánh Giá Bộ Cơ Sở Dữ Liệu Trong Phân Loại Rối Loạn Phổ Tự Kỷ Sử Dụng Thuật Toán SVM, Random Forest
Tác giả	Phạm Quang Thuận
Người hướng dẫn	PGS.TS. Nguyễn Đình Thuân
Trường học	Trường Đại Học Nha Trang
Chuyên ngành	Công Nghệ Thông Tin
Thể loại	Luận Văn Thạc Sĩ
Năm xuất bản	2019
Thành phố	Khánh Hòa

Định dạng
Số trang	72
Dung lượng	1,38 MB