Sử dụng một số thuật toán học máy để dự đoán thành tích học tập của học sinh

L˝I CAM OAN Tỉi xin cam oan: Lu“n v«n th⁄c sÿ chuy¶n ng nh Khoa håc m¡y t ‰nh, t¶n • t i Sß dưng mºt sŁ thu“t to¡n håc m¡y ” dü o¡n th nh t‰ch håc t“p cıa håc sinh l cỉng tr…nh nghi¶n cøu, t…m hi”u v trnh b y tổi thỹc hiằn dữợi sỹ hữợng dÔn khoa hồc ca TS m Thanh Phữỡng, Trữớng i håc Cỉng ngh» Thỉng tin v Truy•n thỉng - ⁄i håc Th¡i Nguy¶n K‚t qu£ t…m hi”u, nghi¶n cøu lu“n v«n l ho n to n trung thüc, khỉng vi phm bĐt iãu g lut s hu tr‰ tu» v ph¡p lu“t Vi»t Nam N‚u sai, tæi ho n to n chu trĂch nhiằm trữợc phĂp lut T§t c£ c¡c t i li»u, b i b¡o, khâa lu“n, cỉng cư phƒn m•m cıa c¡c t¡c gi£ kh¡c ữổc sò dửng li lun vôn n y ãu ữổc ch dÔn tữớng minh vã tĂc giÊ v ãu câ danh möc t i li»u tham kh£o Th¡i Nguyản, ng y 18 thĂng 10 nôm 2020 TĂc giÊ lu“n v«n Nguy„n B‰ch Qnh i L˝IC MÌN T¡c gi£ xin chƠn th nh cÊm ỡn TS m Thanh Phữỡng, trữớng i hồc Cổng nghằ thổng tin v truyãn thổng - i hồc ThĂi Nguyản, l giĂo viản hữợng dÔn khoa hồc Â hữợng dÔn tĂc giÊ ho n th nh lun vôn n y, xin ữổc cÊm ỡn cĂc thƒy, cỉ gi¡o tr÷íng ⁄i håc cỉng ngh» thỉng tin v truy•n thỉng nìi t¡c gi£ theo håc v ho n th nh chữỡng trnh cao hồc Â nhiằt tnh gi£ng d⁄y v gióp ï Xin c£m ìn tr÷íng THPT L÷ìng Th‚ Vinh - C'm Ph£ - Qu£ng Ninh nìi tĂc giÊ cổng tĂc Â to mồi iãu kiằn thun læi ” t¡c gi£ thu th“p dœ li»u, ho n th nh nghiản cứu v chữỡng trnh hồc V cui xin cÊm ỡn gia nh, bn b, ỗng nghiằp Â ng viản, giúp ù tĂc giÊ sut thíi gian håc t“p, nghi¶n cøu v ho n th nh lun vôn n y Xin chƠn th nh cÊm ỡn ThĂi Nguyản, ng y 18 thĂng nôm 2020 T¡c gi£ lu“n v«n Nguy„n B‰ch Quýnh ii DANHS CHHNHV 2.1 Phi‚u kh£o s¡t thæng tin 2.2 Phi‚u kh£o s¡t thæng tin (ti‚p) 2.3 Mºt sŁ thuºc t‰nh (a) 2.4 Mºt sŁ thuºc t‰nh (b) 2.5 Mºt sŁ thuºc t‰nh (c) 2.6 24 2.7 ThŁng k¶ thuºc t‰nh bà thi‚u dœ li»u 2.8 Gom cöm c¡c håc sinh theo trung b…nh c¡c mæ 2.9 Feature Selection vỵi Lasso 3.1 Accuracy explode c¡c model sß dưng all feat 3.2 Accuracy explode c¡c model sß dưng features 3.3 K‚t qu£ dü o¡n i”m cıa mºt sŁ håc sinh sß dö all feature 3.4 K‚t qu£ dü o¡n i”m cıa mºt sŁ håc sinh sß dư feature selection iii DANH S CH B NG 3.1 º ch‰nh x¡c cıa c¡c mỉ h…nh training vỵi dœ li»u ı thuºc t‰nh 3.2 39 º ch‰nh x¡c cıa c¡c mæ h…nh training vỵi dœ li»u lüa chån thuºc t‰nh iv 40 DANH MƯC KÞ HI U, TØVI TT T R Z C Rn Ck jj:jj SV M LR NB KNN TBCM MLE MAP NBC RF AD GD IDE v MƯC LƯC Líi cam oan i Líi c£m ìn ii Danh s¡ch h…nh v‡ ii Danh s¡ch b£ng iii Danh möc kỵ hiằu, t vit tt Mð v ƒu Ch÷ìng T˚NG QUAN V H¯C M Y 1.1 Thu“t to¡n håc m¡y 1.2 Dœ li»u 1.3 C¡c b i to¡n cì b£n machine learning 1.4 Ph¥n nhâm c¡c thu“t to¡n machine learning 12 1.5 H m m§t m¡t v tham sŁ mỉ h…nh 17 Ch÷ìng THU TH P V XÛ LÞ DÚ LI U 19 2.1 Ph¡t bi”u b i to¡n 19 2.2 Thu th“p dœ li»u 20 2.3 Feature Engineering 22 Ch÷ìng TRAINING M˘ H NH V NH GI K T QU 30 3.1 Mºt sŁ thu“t to¡n lüa chån training mæ h…nh 30 3.2 Training mæ h…nh 38 3.3 Lüa chån v tŁi ÷u hâa tham sŁ mæ h…nh 40 vi 3.4 K‚t qu£ v ¡nh gi¡ 43 K‚t lu“n chung 45 T i li»u tham kh£o 45 PHÖ LÖC 47 3.1 QuĂ trnh xò lỵ data - file process - data.py 47 3.2 file main 51 vii M— U Ng y nay, x¢ hºi ng y c ng ph¡t tri”n, vi»c ÷a m¡y t‰nh v o sß dưng, phưc vư cho cỉng vi»c íi sŁng cıa ngữới Â sÊn sinh mt lữổng d liằu lợn v phức (big data), ữổc s hõa v lữu tr trản mĂy tnh Nhng d liằu lợn n y cõ th bao gỗm cĂc d liằu câ c§u tróc, khỉng câ c§u tróc v b¡n c§u tróc â câ th” l dœ li»u thỉng tin b¡n h ng trüc tuy‚n, l÷u l÷ỉng truy c“p trang web, thổng tin cĂ nhƠn, thõi quen hot ng thữớng ng y ca ngữới.v.v Chúng chứa ỹng nhiãu thổng tin quỵ bĂu m khai thĂc hổp lỵ s tr th nh tri thøc, t i s£n mang l⁄i gi¡ tr lợn ThĂch thức t cho ngữới l ph£i ÷a c¡c ph÷ìng ph¡p, thu“t to¡n v cỉng cử hổp lỵ l m phƠn tch ữổc lữổng d liằu lợn nhữ vy Ngữới ta nhn thĐy mĂy tnh cõ khÊ nông phƠn tch, xò l d liằu lợn v phức tp, tm cĂc mÔu v quy lut, vữổt quĂ khÊ nông, tc tnh toĂn ghi nhợ ca b nÂo ngữới KhĂi niằm hồc mĂy t õ h nh th nh ị tững cì b£n cıa håc m¡y l m¡y t‰nh câ th” håc häi, håc tü ºng theo kinh nghi»m [1] M¡y tnh phƠn tch lữổng lợn d liằu, tm thĐy cĂc mÔu, quy tc 'n d liằu, sò dửng cĂc quy t›c â ” mỉ t£ dœ li»u mỵi mºt c¡ch tü ºng v li¶n tưc c£i thi»n Håc m¡y cõ rĐt nhiãu ứng dửng, bao gỗm nhiãu lắnh vỹc MĂy t m kim sò dửng hồc mĂy xƠy düng mŁi quan h» tŁt hìn giœa c¡c cưm tł t…m ki‚m v c¡c trang web B‹ng c¡ch ph¥n t‰ch nºi dung cıa c¡c trang web, cỉng cư t…m ki‚m câ th” x¡c ành tł n o l cöm tł quan trång nh§t vi»c x¡c ành mºt trang web nhĐt nh v hồ cõ th sò dửng cửm tł n y ” tr£ thỉng tin k‚t qu£ phị hỉp cho cưm tł t…m ki‚m nh§t ành [2] Cỉng ngh» nh“n d⁄ng h…nh £nh cơng sß dưng håc m¡y ” x¡c ành c¡c Łi t÷ỉng cư th”, chflng h⁄n nhữ khuổn mt [5] u tiản thut toĂn hồc mĂy phƠn tch hnh Ênh cõ chứa mt i tữổng nhĐt nh Nu ữổc cung cĐp hnh Ênh cho quĂ tr…nh n y, thu“t to¡n câ th” x¡c ành ÷ỉc h…nh £nh câ chøa Łi t÷ỉng â hay khỉng [3] Ngo i hồc mĂy cõ th ữổc sò dửng ” hi”u lo⁄i s£n ph'm m kh¡ch h ng quan t¥m, b‹ng c¡ch ph¥n t‰ch c¡c s£n ph'm qu¡ khứ m ngữới dũng Â mua MĂy tnh cõ th ữa ã xuĐt cĂc sÊn ph'm khĂch h ng cõ th mua vợi xĂc suĐt cao [1] TĐt cÊ nhng v dử trản ãu cõ nguyản tc cỡ bÊn ging nhau: MĂy tnh xò lỵ v hồc cĂch xĂc ành dœ li»u, sau â sß dưng ki‚n thøc n y ữa quyt nh vã d liằu t÷ìng lai Tịy theo lo⁄i dœ li»u ƒu v o, thu“t to¡n håc m¡y câ th” ÷ỉc chia th nh håc câ gi¡m s¡t v håc khæng gi¡m s¡t Trong håc câ gi¡m s¡t, dœ li»u ƒu v o ¢ cõ nhÂn v i km vợi mt cĐu trúc Â bi‚t [1], [5] Dœ li»u ƒu v o ÷ỉc gåi l d liằu huĐn luyằn Thut toĂn thữớng cõ nhiằm vư t⁄o mºt mỉ h…nh câ th” dü o¡n mºt sŁ thuºc t‰nh tł c¡c thuºc t ‰nh ¢ bit Sau mổ hnh ữổc to, nõ ữổc sò dửng xò lỵ d liằu cõ cĐu trúc ging dœ li»u ƒu v o Trong håc khæng gi¡m s¡t, d liằu u v o chữa cõ nhÂn, khổng cõ c§u tróc Nhi»m vư cıa thu“t to¡n l x¡c ành mt cĐu trúc d liằu.[2] ữổc sỹ gổi ỵ ca giĂo viản hữợng dÔn, em bữợc u tm hiu nghi¶n cøu øng dưng håc m¡y gi¡o dưc nh‹m thüc hi»n mºt nhi»m vö: Dü o¡n k‚t qu£ håc t“p cıa håc sinh düa tr¶n nhœng dœ li»u thu thp ữổc ca hồc sinh Ơy l mt hữợng nghiản cứu ang thu hút sỹ quan tƠm ca nhiãu nh khoa hồc trản th giợi [6], [7], [8] Trong [7], c¡c t¡c to¡n C: float number, default=1.0 Tham sŁ regularization nghàch £o - Mºt bi‚n i•u khi”n tr… sòa i cữớng ca regularization bng cĂch t v tr nghch Êo vợi b iãu chnh Lambda solvernewtoncg, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’, default=’lbfgs’ Thu“t to¡n sß dưng b i toĂn ti ữu hõa i vợi cĂc d liằu nh, liblinear l lỹa chồn tt nhĐt, cặn sag v saga sò dửng nhanh trản cĂc d liằu i vợi cĂc vĐn ã muti class, ch newton-cg, sag, saga and lbfgs xò lỵ ữổc B param cho thut toĂn Logistic Regression l [4] ’C’: 10, ’penalty’: ’l2’, ’solver’: ’newton-cg’ V º ch‰nh x¡c ⁄t ÷ỉc l 71% Random Forest: C¡c param quan trång cıa thu“t to¡n Random forest bao gỗm [4]: n estimatorsint, default=100 S lữổng cƠy rng maxdepthint, default=None º s¥u tŁi a cıa c¥y N‚u None, th cĂc nút ữổc m rng cho n tĐt c£ c¡c l¡ ho°c cho ‚n t§t c£ c¡c chøa ‰t hìn dataset max f eatures auto , sqrt , log2 , int or float, default= auto SŁ l÷ỉng c¡c feature cƒn xem x†t t…m ki‚m sü ph¥n chia tŁt nh§t: N‚u int number, sau â xem x†t s lữổng feature ti mỉi ln phƠn chia Nu float, th… max f eatures l mºt ph¥n sŁ v c¡c feature l m trặn (max ln phƠn chia Nu auto , th…, n‚u max N‚u l sqrt , th… max auto auto) N‚u log2, th… max None, th… max f eatures = n Bº param cho thu“t to¡n RandomForest Classifier [4] ’max depth’: 4, ’max f eatures’: ’sqrt’, ’n estimators’: 100 V º ch‰nh x¡c ⁄t ÷ỉc l 79% Gradient Boosting: B param quan trồng ca Gradient Boosting bao gỗm [4]: n estimatorsint, default=100 SŁ l÷ỉng c¡c lƒn thóc 41 'y thỹc hiằn Tông cữớng dc khĂ mnh giÊi quyt vĐn ã over-fitting nản mt s lợn thữớng mang li hiằu suĐt tt hỡn learning default=0.1 T l» håc t“p thu hµp sü ratef loat, âng gâp cıa mØi c¥y b‹ng c¡ch learning rate Câ sü ¡nh Œi giœa learning rate v n estimators subsamplefloat, default=1.0 T lằ sample ữổc sò dửng fit vợi tng tng nhõm Nu nhọ hỡn 1.0, iãu n y dÔn n Stochastic Gradient Boosting subsample tữỡng tĂc vợi tham s n estimators Chồn mÔu phử nhọ hỡn1.0 dÔn n giÊm phữỡng sai v t«ng bias Bº param cho thu“t to¡n GradientBoosting Classifier ’learning rate’: 0.1, ’max - depth’: 4, ’n - estimators’: 10, ’subsample’: 1.0 V º ch‰nh x¡c ⁄t ÷ỉc l 77% Best score: 0.841492 using LogisticRegression ’C’: 1.0, ’penalty’: ’l2’, ’solver’: ’newton cg’ Score LogisticRegression Data Test= 0:8641304347826086 Best score: 0.838270 using KNeighborsClassifier ’metric’: ’euclidean’, ’n neighbors’: 19, ’weights’: ’uniform’ Score KNeighborsClassifier Data Test= 0:8804347826086957 Best score: 0.838270 using SVM ’C’: 50, ’gamma’: ’auto’, ’kernel’: ’poly’ Score SVM Data Test= 0:8804347826086957 Best score: 0.839892 using DecisionTreeClassifier ’criterion’: ’gini’, ’maxdepth’: 2, ’maxf eatures’: ’log2’, ’min samples leaf’: 10, ’min samples split’: Score DecisionTreeClassifier Test= 0:8804347826086957 Best score: 0.840432 using RandomForestClassifier ’max depth’: 4, ’max f eatures’: ’sqrt’, ’n estimators’: 10 Score Random- ForestClassifier Data Test= 0:8804347826086957 42 3.4 K‚t qu£ v ¡nh gi¡ Vi»c thỹc hiằn dỹ oĂn ữổc em xƠy dỹng th nh web demo cho thu“n ti»n Sau lüa chån thu“t to¡n, ng÷íi dịng chån t“p test ho°c dœ li»u cƒn dỹ oĂn upload lản Rỗi chồn sò dửng tĐt cÊ feature hay sò dửng lỹa chồn feature Ênh hững nhĐt Khi chy, hằ thng s ữa im trung b…nh mæn håc cıa dœ li»u cƒn dü o¡n v x‚p lo⁄i t÷ìng øng H…nh £nh 3.3 v 3.4 l k‚t qu£ dü o¡n cıa c¡c b⁄n: Trƒn Th‚ Anh, Vụ TuĐn Anh v Nguyn Minh LƠm vợi viằc sò dưng all feature v feature selection t÷ìng øng K‚t qu£ vợi tĐt cÊ cĂc d liằu test cụng ữổc hin th theo 10 dặng 1, Ơy e ch ữa mºt sŁ b⁄n v… m n h…nh khæng hi”n h‚t Khi cƒn câ th” k†o c¡c cuºn bit Rê r ng ta thĐy sò dửng feature selection, k‚t qu£ ch‰nh x¡c hìn H…nh 3.3: K‚t qu£ dü o¡n i”m cıa mºt sŁ håc sinh sß dưng all feature Tł mºt t“p dœ li»u data thu th“p ÷ỉc mºt c¡ch kh¡ch quan câ th” d„ d ng x¡c ành ÷ỉc c¡c feature °c bi»t £nh h÷ðng ‚n th nh t‰ch håc t“p cıa håc sinh sau mt chuỉi cĂc bữợc thit k feature, thu thp thỉng tin chu'n hâa dœ li»u x¥y düng model CuŁi xƠy dỹng model dỹ oĂn ữổc nông lỹc hồc t“p cıa håc sinh Tł â ta câ th” øng döng 43 H…nh 3.4: K‚t qu£ dü o¡n i”m cıa mºt sŁ håc sinh sß dưng feature selection v phĂt trin thảm ữa Ănh giĂ nông lỹc ca hồc sinh, tm nhng vĐn ã khõ khôn vữợng m›c nan gi£i v ÷a nhœng ph÷ìng ph¡p håc t“p phị hỉp º ch‰nh x¡c cıa model ⁄t ÷ỉc l 80% chữa phÊi l s lỵ tững mt phn l chĐp nhn ữổc v cõ th sò dửng Trong lun vôn n y ngo i kt quÊ t ữổc cụng ang cõ nhĐt nhiãu hn ch mc phÊi tỗn ồng cn phÊi giÊi quyt thảm Lữổng d liằu ang cặn khĂ t chữa phÊn Ănh bao qu¡t cho to n bº b i to¡n cƒn gi£i quyt dÔn tợi nông lỹc dỹ oĂn km D liằu Null ang cặn nhiãu viằc thu thp v cĂch xò lỵ loi d liằu n y chữa ữổc tt Nản h⁄n ch‚ lo⁄i dœ li»u n y nh§t l ð c¡c feature nh⁄y c£m khæng th” t‰nh to¡n Thu th“p thảm d liằu tm kim v thôm dặ cĂc phữớng ph¡p process data kh¡c mưc ti¶u cuŁi cịng l l m cho dœ li»u phong phó v ph£n ¡nh mºt c¡ch óng ›n kh¡ch quan 44 K T LU N CHUNG Dữợi sỹ ch bÊo tn tnh ca ca GiĂo viản hữợng dÔn, côn v o ã cữỡng lun vôn Â ữổc phả duyằt, lun vôn Â t ÷ỉc mºt sŁ nhi»m vư sau: (1) T…m hi”u v• mæ h…nh gi£i quy‚t mºt b i to¡n thüc t‚ b‹ng håc m¡y V“n dưng mỉ h…nh ” ti‚p c“n giÊi quyt mt vĐn ã cử th Nghiản cứu khĂ chi ti‚t c¡c thu“t to¡n håc m¡y tł ph¥n t‰ch to¡n håc, ‚n t…m nghi»m tŁi ÷u, c i °t th nh tho trản mổi trữớng Python v cĂc IDE chuyản dũng (2) Hiu rê v thỹc hiằn ữổc quy trnh thao tĂc vợi cĂc c trững ca d liằu tł â l m s⁄ch, chu'n hâa, i•n khuy‚t, lüa chồn cĂc thuc t nh Ênh hững nhiãu n d li»u (3) Hi”u v thüc h nh ÷ỉc qu¡ tr…nh training v tŁi ÷u hâa, lüa chån tham sŁ cho mºt mỉ h…nh håc m¡y (4) ÷a ÷ỉc mºt s dỹ oĂn, Ănh giĂ vợi hồc sinh cõ nhÂn v chữa cõ nhÂn Trản cỡ s ca cĂc kt quÊ Â t ữổc, nu cõ th tip tửc nghiản cứu th lun vôn l nãn tÊng tt nghiản cứu thảm mt s vĐn ã sau: CĂc phữỡng phĂp hồc mĂy ứng dửng cho dỹ oĂn, phƠn lợp Ti ÷u hâa tham sŁ mỉ h…nh håc m¡y X¥y düng c¡c øng döng thüc t‚ câ t‰nh kh£ thi v hi»u qu£ cao 45 T ILI UTHAMKH O [1] Vô Hœu Ti»p, Machine learning cì b£n Ebook on machinelearning-coban.com, 2020 [2] Ho ng XuƠn HuĐn , GiĂo Trnh hồc mĂy NXB HQG H Nºi, 2015 [3] Aur†lien G†ron, Hands-on Machine Learning with Scikit- Learn, Keras, and TensorFlow Published by O’Reilly Media, 2019 [4] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, Founda-tions of Machine Learning Massachusetts Institute of Technology, 2018 [5] Peter Flach, Machine learning, the art and science of algorithms that make sence of data Cambridge, 2012 [6] Murat Pojon, Using Machine Learning to Predict Student Perfor-mance M Sc Thesis, University of Tampere, 2017 [7] Elaf Abu Amrieh, Thair Hamtini and Ibrahim Aljarah, Mining Ed-ucational Data to Predict Student’s academic Performance using Ensemble Methods International Journal of Database Theory and Application Vol.9, No.8 (2016), pp.119-136 [8] Kuca Danijel, Juricic Vedran and Dambic Goran, Machine Learning in Education - a Survey of Current Research Trends Proceedings of the 29th DAAAM International Symposium, pp.04060410, B Katalinic (Ed.), Published by DAAAM International, ISBN 978-3-902734-20-4, ISSN 1726-9679, Vienna, Austria, 2018 46 PHƯ LƯC -CODE CH×ÌNG TR NH 3.1 QuĂ trnh xò lỵ data - file process - data.py import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sb import random import pickle from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble.bagging import BaggingClassifier from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier from sklearn.ensemble.weight_boosting import AdaBoostClassifier from sklearn.model_selection import train_test_split from sklearn import model_selection from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense def grid_serarch(grid, model, X_train, y_train): cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=451) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring=’accuracy’,error_score=0) return grid_search.fit(X_train, y_train) def tuning_logistic(X_train , y_train): model = LogisticRegression() solvers = [’newton-cg’, ’lbfgs’, ’liblinear’] penalty = [’l1’, ’l2’] c_values = [500, 100, 10, 1.0, 0.1, 0.01, 0.001] # define grid search grid = dict(solver=solvers,penalty=penalty,C=c_values) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using LogisticRegression %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ 47 def tuning_knn(X_train , y_train): model = KNeighborsClassifier() n_neighbors = range(1, 100, 2) weights = [’uniform’, ’distance’] metric = [’euclidean’, ’manhattan’, ’minkowski’] # define grid search grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using KNeighborsClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_svm(X_train , y_train): model = SVC() kernel = [’poly’, ’rbf’, ’sigmoid’] C = [50, 10, 1.0, 0.1, 0.01, 0.01, 0.05, 0.001] gamma = [’scale’, ’auto’] grid = dict(kernel=kernel,C=C,gamma=gamma) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using SVM %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_dt(X_train , y_train): model = DecisionTreeClassifier() grid = { ’max_features’: [’log2’, ’sqrt’,’auto’], ’criterion’: [’entropy’, ’gini’], ’max_depth’: [2, 3, 5, 10, 50], ’min_samples_split’: [2, 3, 50, 100], ’min_samples_leaf’: [1, 5, 8, 10] } # define grid search grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using DecisionTreeClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_rf(X_train , y_train): model = RandomForestClassifier() n_estimators = [10, 30, 50, 70, 100] max_depth = [2, 3, 4] max_features = [’sqrt’, ’log2’] # define grid search grid = dict(n_estimators=n_estimators,max_features=max_features, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using RandomForestClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ 48 def tuning_gd(X_train , y_train): model = GradientBoostingClassifier() n_estimators = [5, 10, 20] learning_rate = [0.001, 0.01, 0.1] subsample = [0.5, 0.7, 1.0] max_depth = [2, 3, 4] # define grid search grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best: %f using GradientBoostingClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def compare_model(X, y): models = [] models.append((’LR’, LogisticRegression())) models.append((’KNN’, KNeighborsClassifier())) models.append((’Tree’, DecisionTreeClassifier())) models.append((’NB’, GaussianNB())) models.append((’SVM’, SVC())) models.append((’RF’, RandomForestClassifier())) models.append((’AD’, AdaBoostClassifier())) models.append((’GD’, GradientBoostingClassifier())) models.append((’BG’, BaggingClassifier())) results = [] names = [] scoring = ’accuracy’ for name, model in models: kfold = model_selection.KFold(n_splits=5, random_state=42) cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = plt.figure(figsize=(20,10)) fig.suptitle(’Algorithm Comparison’) ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show() def explode_model(X, y, columns = ’all’): path_model = ’Model/’ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) param = tuning_logistic(X_train, y_train) model = LogisticRegression(C = param[’C’], penalty = param[’penalty’], solver = param[’solver’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’logistic.sav’ if columns == ’all’ else path_model + ’logistic_selection.sav’, ’wb’)) print(’Score LogisticRegression Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_knn(X_train, y_train) model = KNeighborsClassifier(n_neighbors = param[’n_neighbors’], weights = 49 param[’weights’], metric = param[’metric’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’knn.sav’ if columns == ’all’ else path_model + ’knn_selection.sav’, ’wb’)) print(’Score KNeighborsClassifier Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_svm(X_train, y_train) model = SVC(C = param[’C’], kernel = param[’kernel’], gamma = param[’gamma’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’svm.sav’ if columns == ’all’ else path_model + ’svm_selection.sav’, ’wb’)) print(’Score SVM Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_dt(X_train, y_train) model = DecisionTreeClassifier(max_features = param[’max_features’], criterion = param[’criterion’], max_depth = param[’max_depth’], min_samples_split = param[’min_samples_split’], min_samples_leaf = param[’min_samples_leaf’] ).fit(X_train, y_train) print(’Score DecisionTreeClassifier Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_rf(X_train, y_train) model = RandomForestClassifier(n_estimators = param[’n_estimators’], max_depth= param[’max_depth’], max_features = param[’max_features’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’rf.sav’ if columns == ’all’ else path_model + ’rf_selection.sav’, ’wb’)) print(’Score RandomForestClassifier Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_gd(X_train, y_train) model = GradientBoostingClassifier(learning_rate = param[’learning_rate’], max_depth = param[’max_depth’], n_estimators = param[’n_estimators’], subsample = param[’subsample’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’xgb.sav’ if columns == ’all’ else path_model + ’xgb_selection.sav’, ’wb’)) print(’Score GradientBoostingClassifier Data Test=’,model.score(X_test, y_test)) def create_model_mlp(X, y, ip = 43, epochs=20): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=451) # define the keras model model = Sequential() model.add(Dense(128, input_dim=ip, activation=’relu’)) model.add(Dense(64, activation=’relu’)) model.add(Dense(16, activation=’relu’)) model.add(Dense(4, activation=’softmax’)) # compile the keras model model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) model.fit(X_train, y_train, epochs=25, batch_size=4) score, acc = model.evaluate(X_test, y_test, batch_size=4) print(’Test score:’, score) print(’Test accuracy:’, acc) model_json = model.to_json() path = ’Model/model_mlp’ if ip > 30 else ’Model/model_mlp_select’ with open(path + ".json", "w") as json_file: 50 json_file.write(model_json) # serialize weights to HDF5 model.save_weights(path + ".h5") print("Saved model to disk") 3.2 file main import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sb import random import pickle from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble.bagging import BaggingClassifier from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier from sklearn.ensemble.weight_boosting import AdaBoostClassifier from sklearn.model_selection import train_test_split from sklearn import model_selection from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense def grid_serarch(grid, model, X_train, y_train): cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=451) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring=’accuracy’,error_score=0) return grid_search.fit(X_train, y_train) def tuning_logistic(X_train , y_train): model = LogisticRegression() solvers = [’newton-cg’, ’lbfgs’, ’liblinear’] penalty = [’l1’, ’l2’] c_values = [500, 100, 10, 1.0, 0.1, 0.01, 0.001] # define grid search grid = dict(solver=solvers,penalty=penalty,C=c_values) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using LogisticRegression %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_knn(X_train , y_train): model = KNeighborsClassifier() 51 n_neighbors = range(1, 100, 2) weights = [’uniform’, ’distance’] metric = [’euclidean’, ’manhattan’, ’minkowski’] # define grid search grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using KNeighborsClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_svm(X_train , y_train): model = SVC() kernel = [’poly’, ’rbf’, ’sigmoid’] C = [50, 10, 1.0, 0.1, 0.01, 0.01, 0.05, 0.001] gamma = [’scale’, ’auto’] grid = dict(kernel=kernel,C=C,gamma=gamma) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using SVM %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_dt(X_train , y_train): model = DecisionTreeClassifier() grid = { ’max_features’: [’log2’, ’sqrt’,’auto’], ’criterion’: [’entropy’, ’gini’], ’max_depth’: [2, 3, 5, 10, 50], ’min_samples_split’: [2, 3, 50, 100], ’min_samples_leaf’: [1, 5, 8, 10] } # define grid search grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using DecisionTreeClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_rf(X_train , y_train): model = RandomForestClassifier() n_estimators = [10, 30, 50, 70, 100] max_depth = [2, 3, 4] max_features = [’sqrt’, ’log2’] # define grid search grid = dict(n_estimators=n_estimators,max_features=max_features, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using RandomForestClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_gd(X_train , y_train): model = GradientBoostingClassifier() n_estimators = [5, 10, 20] learning_rate = [0.001, 0.01, 0.1] 52 subsample = [0.5, 0.7, 1.0] max_depth = [2, 3, 4] # define grid search grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best: %f using GradientBoostingClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def compare_model(X, y): models = [] models.append((’LR’, LogisticRegression())) models.append((’KNN’, KNeighborsClassifier())) models.append((’Tree’, DecisionTreeClassifier())) models.append((’NB’, GaussianNB())) models.append((’SVM’, SVC())) models.append((’RF’, RandomForestClassifier())) models.append((’AD’, AdaBoostClassifier())) models.append((’GD’, GradientBoostingClassifier())) models.append((’BG’, BaggingClassifier())) results = [] names = [] scoring = ’accuracy’ for name, model in models: kfold = model_selection.KFold(n_splits=5, random_state=42) cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = plt.figure(figsize=(20,10)) fig.suptitle(’Algorithm Comparison’) ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show() def explode_model(X, y, columns = ’all’): path_model = ’Model/’ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) param = tuning_logistic(X_train, y_train) model = LogisticRegression(C = param[’C’], penalty = param[’penalty’], solver = param[’solver’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’logistic.sav’ if columns == ’all’ else path_model + ’logistic_selection.sav’, ’wb’)) print(’Score LogisticRegression Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_knn(X_train, y_train) model = KNeighborsClassifier(n_neighbors = param[’n_neighbors’], weights = param[’weights’], metric = param[’metric’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’knn.sav’ if columns == ’all’ else path_model + ’knn_selection.sav’, ’wb’)) print(’Score KNeighborsClassifier Data Test=’, model.score(X_test, y_test)) 53 print(’===================================================================================================== param = tuning_svm(X_train, y_train) model = SVC(C = param[’C’], kernel = param[’kernel’], gamma = param[’gamma’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’svm.sav’ if columns == ’all’ else path_model + ’svm_selection.sav’, ’wb’)) print(’Score SVM Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_dt(X_train, y_train) model = DecisionTreeClassifier(max_features = param[’max_features’], criterion = param[’criterion’], max_depth = param[’max_depth’], min_samples_split = param[’min_samples_split’], min_samples_leaf = param[’min_samples_leaf’] ).fit(X_train, y_train) print(’Score DecisionTreeClassifier Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_rf(X_train, y_train) model = RandomForestClassifier(n_estimators = param[’n_estimators’], max_depth= param[’max_depth’], max_features = param[’max_features’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’rf.sav’ if columns == ’all’ else path_model + ’rf_selection.sav’, ’wb’)) print(’Score RandomForestClassifier Data Test=’, model.score(X_test, y_test)) print(’===================================================================================================== param = tuning_gd(X_train, y_train) model = GradientBoostingClassifier(learning_rate = param[’learning_rate’], max_depth = param[’max_depth’], n_estimators = param[’n_estimators’], subsample = param[’subsample’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’xgb.sav’ if columns == ’all’ else path_model + ’xgb_selection.sav’, ’wb’)) print(’Score GradientBoostingClassifier Data Test=’,model.score(X_test, y_test)) def create_model_mlp(X, y, ip = 43, epochs=20): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=451) # define the keras model model = Sequential() model.add(Dense(128, input_dim=ip, activation=’relu’)) model.add(Dense(64, activation=’relu’)) model.add(Dense(16, activation=’relu’)) model.add(Dense(4, activation=’softmax’)) # compile the keras model model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) model.fit(X_train, y_train, epochs=25, batch_size=4) score, acc = model.evaluate(X_test, y_test, batch_size=4) print(’Test score:’, score) print(’Test accuracy:’, acc) model_json = model.to_json() path = ’Model/model_mlp’ if ip > 30 else ’Model/model_mlp_select’ with open(path + ".json", "w") as json_file: json_file.write(model_json) # serialize weights to HDF5 model.save_weights(path + ".h5") 54 print("Saved model to disk") 55 ... qu£ håc t“p cıa håc sinh, x¥y düng mỉ h…nh håc m¡y câ th” ÷a dü o¡n v• mºt håc sinh câ v†c tì dœ li»u cho trữợc s cõ kt quÊ hồc nhữ th n o? T õ cõ th tữ vĐn cho hồc sinh sinh viản kp thới nu... hồc sinh Â khổng cung cĐp thổng tin vã b mà anh ch b , thổng tin vã ng y sinh, giợi tnh, dƠn tc Nhn v o biu ỗ ta thĐy nhiãu trữớng Ơy cõ hỡn 500 hồc sinh khổng iãn thổng tin TĐt cÊ hồc sinh. .. håc t“p cıa håc sinh công cõ th nhn dữợi gõc mt b i toĂn phƠn loi D liằu (hồc sinh) ữổc phƠn vã, chflng h⁄n mºt lo⁄i: Xu§t s›c, Giäi, Kh¡, Trung bnh Nhiằm vử: PhƠn loi mt hồc sinh vã mºt c¡c