Applying random forest and neural network model to predict customers behaviors

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS’ BEHAVIORS Student’s name NGUYEN HUONG LY Hanoi - Year 2020 VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS’ BEHAVIORS SUPERVISOR: Dr Tran Duc Quynh STUDENT: Nguyen Huong Ly STUDENT ID: 16071293 COHORT: MAJOR: MIS2016A Hanoi - Year 2020 LETTER OF DECLARATION I hereby declare that the Graduation Project APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS’ BEHAVIORS is the results of my own research and has never been published in any work of others During the implementation process of this project, I have seriously taken research ethics; all findings of this projects are results of my own research and surveys; all references in this project are clearly cited according to regulations I take full responsibility for the fidelity of the number and data and other contents of my graduation project Hanoi, 4th June 2020 Student Nguyen Huong Ly ACKNOWLEDGEMENT Firstly, I would like to express my sincere and appreciation toward Dr Tran Duc Quynh I am proud and honored to be guided and helped to finish my graduation thesis under his supervision Secondly, I also would to express my gratitude to teachers and professors who have taught me to have enough knowledge and skills to be able to finish my graduation thesis Last but not least, sincere thanks to my family and friends who always stay by my side and encourage me overcome challenges during the process of writing graduation thesis Hanoi, 4th June 2020 Student Nguyen Huong Ly TABLE OF CONTENTS CHAPTER I INTRODUCTION TO MACHINE LEARNING 1.1 Definition 1.2 Application 1.3 Classification 10 1.4 Advantages and disadvantages 10 1.5 Accuracy in machine learning 12 CHAPTER THEORETICAL BACKGROUND 14 2.1 Decision tree 14 2.1.1 Definition 14 2.1.2 Decision tree graph description 14 2.1.3 Induction 16 2.1.4 Advantages and disadvantages 21 2.2 Random forest 23 2.2.1 Definition 23 2.2.2 Why random forest is better than decision tree 23 2.2.3 How random forest works 24 2.2.4 Advantages and disadvantages 25 2.2.5 Application 26 2.3 Multilayer perceptron 26 2.3.1 Definition 26 2.3.2 How multilayer perceptron works 27 2.3.3 Advantage and disadvantage 28 2.3.4 Application: 28 CHAPTER CASE STUDY 30 3.1 Problem 30 3.1.1 Problem statement 30 3.1.2 Data description 30 3.2 Tools introduction 31 3.2.1 Python 31 3.2.2 Packages 32 3.3 Problem solving 33 3.3.1 Data insight 33 3.3.2 Data preprocessing 35 3.3.3 Model and result 36 3.3.4 Conclusion 41 REFERENCES 42 APPENDIX 45 TABLE OF NOTATIONS AND ABBREVIATIONS Abbreviation Meaning AI artificial intelligence e.g for example cm centimeter MLP Multilayer perceptron LIST OF TABLE Table Confusion metric 12 Table Random forest results 37 Table MLP results 38 Table Oversampling Random Forest 38 Table Oversampling MLP 39 Table Under-sampling Random Forest 40 Table Under-sampling MLP 41 LIST OF CHARTS AND FIGURE Figure 1.Decision tree graph description 15 Figure Nominal test condition 17 Figure Continuous test condition 18 Figure 4.Continuous test condition 19 Figure Best split 19 Figure 6.Identify best split 20 Figure 7.Stop spliting 20 Figure 8.Stop spliting 21 Figure 9.Multilayer perceptron 27 Figure 10 General description 33 Figure 11.Number of responses "Yes" and "No" 34 Figure 12.Relationship of renew offer and response 35 ABSTRACT The overall purpose of this thesis is applying machine learning, random forest and multilayer perceptron specifically, to solve realistic problem The thesis consists of parts, “Introduction to machine learning” which briefly introduce the concept of machine learning and its application, “Theoretical background” presents the concept of classifiers will be used in solving problem, lastly “Case study” applies all above theories into real-life problem After solving, there are some important values can be concluded, such as interesting insights into dataset and how to build the best possible prediction model, etc Figure 11.Number of responses "Yes" and "No" Relationship between responses and renew offer type is shown below An interesting insight in this chart is that there are no customer responded ‘Yes’ to offer while nearly 700 customers say ‘Yes’ to offer From this information, it is likely that offer will have more chance of being purchased by customers while offer needs some changes or improvements to be able to be sold 34 Figure 12.Relationship of renew offer and response 3.3.2 Data preprocessing The dataset contains no missing or duplicated values In preprocessing step, values in “Response” column are converted into binary, “No” equals and “Yes” equals correspondingly Two columns “Customer” and “Effective To Date” are dropped as they serve no purpose in final model To serve building model process afterward, categorical columns are converted into numerical using get_dummies and LabelEncoder Columns with ordinal values are treated with LabelEncoder while others are treated with get_dummies:  LabelEncoder  Coverage  Education  Location Code  Vehicle Size  get_dummies:  State  EmploymentStatus 35  Gender  Marital Status  Policy Type  Policy  Renew Offer Type  Sales Channel  Vehicle Class 3.3.3 Model and results Dataset is now divided into training and test dataset with ratio of 80% and 20% correspondingly After split dataset into 2, random forest and MLP are applied to training data to build prediction model, while test data are using to measure the performance of model through accuracy and F1 score Results will be calculated using cross-validation with folds Because dataset is imbalance therefore there will be cases to apply prediction model, the first one is applying on whole dataset, the second and third one is using oversampling and under-sampling to get equal “Yes” and “No” responses then applying prediction model The result for each case is shown below: a) Applying on whole dataset  Random forest Max_features n_estimators criterion Scores 20 gini F1 = 0.942 (±0.01) Acc = 0.98 (±0.01) 20 entropy F1 = 0.944(±0.02) Acc = 0.98 (±0.0) 10 20 gini F1 = 0.956(±0.0) Acc = 0.99 (±0.00) 10 20 entropy F1 = 0.956(±0.01) 36 Acc =0.99(±0.01) 15 20 gini F1 = 0.981(±0.0) Acc = 0.99 (±0.00) 15 20 entropy F1 = 0.983(±0.0) Acc = 0.99 (±0.00) 15 100 gini F1 = 0.981(±0.01 Acc = 1.00 (±0.00) 15 100 entropy F1 = 0.983(±0.0) Acc = 1.00 (±0.00) Table Random forest results  MLP Hiddenlayer activation Scores 5,10 relu F1=0.013(±0.06) Acc = 0.86 (±0.02) 5,10 F1 = 0.00(±0.05) Acc = 0.87(±0.00) 5,10 logistic F1 = 0.00(±0.04) Acc = 0.87(±0.00) 5,10 identity F1=0.015(±0.07) Acc=0.69 (±0.05) 2,5 relu F1=0.00(±0.05) Acc=0.87(±0.00) 2,5 F1=0.00(±0.0) Acc=0.87(±0.00) 2,5 logistic F1=0.00(±0.03) Acc=0.87(±0.00) 2,5 identity F1=0.079(±0.09) Acc=0.77(±0.03) 5,5 relu F1=0.228(±0.01) Acc=0.53(±0.06) 37 5,5 F1=0.253(±0.2) Acc=0.55(±0.03) 5,5 logistic F1=0.252(±0.04) Acc=0.55(±0.014) 5,5 identity F1=0.24(±0.05) Acc=0.5(±0.06) Table MLP results b) Using oversampling  Random forest Max_features n_estimators criterion Scores 20 gini F1 = 0.975 (±0.006) Acc = 0.992 (±0.001) 20 entropy F1 = 0.976 (±0.004) Acc = 0.993(±0.001) 10 20 gini F1 = 0.983(±0.004) Acc = 0.995(±0.001) 10 20 entropy F1 = 0.982(±0.006) Acc =0.99(±0.01) 15 20 gini F1 = 0.983(±0.007) Acc = 0.995 (±0.001) 15 20 entropy F1 = 0.984(±𝟎 𝟎𝟎𝟓) Acc = 0.996(±𝟎 𝟎𝟎𝟏) 15 100 gini F1 = 0.985(±0.004) Acc = 0.996(±0.001) 15 100 entropy F1 = 0.985(±0.006) Acc = 0.996 (±0.001) Table Oversampling Random Forest  MLP Hiddenlayer activation Scores 5,10 relu F1=0.268(±0.01) 38 Acc = 0.504(±0.01) 5,10 F1 = 0.207(±0.01) Acc = 0.389(±0.06) 5,10 logistic F1 = 0.206(±0.02) Acc = 0.363(±0.03) 5,10 identity F1=0.194(±0.03) Acc=0.374 (±0.04) 2,5 relu F1=0.285(±0.02) Acc=0.266(±0.11) 2,5 F1=0.236(±0.02) Acc=0.356(±0.122) 2,5 logistic F1=0.237(±0.01) Acc=0.353(±0.3) 2,5 identity F1=0.163(±0.11) Acc=0.459(±0.03) 5,5 relu F1=0.205(±0.2) Acc=0.67(±0.2) 5,5 F1=0.246(±0.01) Acc=0.343(±0.01) 5,5 logistic F1=0.246(±0.016) Acc=0.345(±0.012) 5,5 identity F1=0.238(±0.019) Acc=0.422(±0.09) Table Oversampling MLP c) Using under-sampling  Random forest Max_features n_estimators criterion Scores 20 gini F1 = 0.709(±0.025) Acc = 0.91 (±0.005) 20 entropy F1 = 0.706(±0.02) 39 Acc = 0.89(±0.01) 10 20 gini F1 = 0.741(±0.02) Acc = 0.91(±0.008) 10 20 entropy F1 = 0.742(±0.027) Acc =0.912(±0.009) 15 20 gini F1 = 0.745(±0.03) Acc = 0.91 (±0.007) 15 20 entropy F1 = 0.74(±0.03) Acc = 0.91(±0.01) 15 100 gini F1 = 0.75(±0.03) Acc = 0.914(±0.014) 15 100 entropy F1 = 0.762(±0.02) Acc = 0.92(±0.01) Table Under-sampling Random Forest  MLP Hiddenlayer activation Scores 5,10 relu F1=0.24(±0.33) Acc = 0.3(±0.18) 5,10 F1 = 0.24(±0.015) Acc = 0.333(±0.1) 5,10 logistic F1 = 0.24(±0.02) Acc = 0.355(±0.01) 5,10 identity F1=0.579(±0.2) Acc=0.155(±0.09) 2,5 relu F1=0.23(±0.02) Acc=0.45(±0.27) 2,5 F1=0.245(±0.02) Acc=0.347(±0.02) 2,5 logistic F1=0.23(±0.02) Acc=0.26(±0.1) 40 2,5 identity F1=0.2(±0.05) Acc=0.451(±0.1) 5,5 relu F1=0.21(±0.04) Acc=0.44(±0.2) 5,5 F1=0.24(±0.01) Acc=0.3(±0.08) 5,5 logistic F1=0.23(±0.02) Acc=0.31(±0.1) 5,5 identity F1=0.167(±0.09) Acc=0.519(±0.2) Table Under-sampling MLP From tables above, it is clear that random forest is the better model for this dataset compares to MLP, applies for all cases For this dataset particularly, using random forest with oversampling, max_features = 15, n_estimators = 20 and criterion = entropy yields the best scores, F1 = 0.984 and accuracy = 0.996 3.3.4 Conclusion For this dataset particularly, random forest will yield a much better model than MLP and oversampling should be used to balance out dataset The best final prediction model is built using random forest classifier with final accuracy and F1 score are 0.996 and 0.984 correspondingly Among offers provided to customers, offer and yields the most ‘Yes’ response while offer yields none, therefore it could be possibly said that customers are hardly interested in offer Company should improve or eliminate offer as it serves no purpose for the company Due to time limitation, only classifiers are used to build prediction model In the near future, other classifiers can be used to build better prediction model for this case study particularly 41 REFERENCES [1] (n.d.) Retrieved from Geeksforgeeks: An introduction to Machine Learning [2] Scikit-learn Tutorial: Machine Learning in Python (n.d.) Retrieved from dataquest: https://www.dataquest.io/blog/sci-kit-learn-tutorial/ [3] Accuracy, Precision, Recall & F1 Score: Interpretation of Performance Measures (9 9, 2016) Retrieved from https://blog.exsilio.com/all/accuracyprecision-recall-f1-score-interpretation-of-performance-measures/ [4] Anaconda (Python distribution) (n.d.) Retrieved from wikipedia: https://en.wikipedia.org/wiki/Anaconda_(Python_distribution) [5] Brownlee, J (16 4, 2014) A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library Retrieved from Machine learning mastery: https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-apython-machine-learning-library/ [6] Brownlee, J (17 5, 2016) Crash Course On Multi-Layer Perceptron Neural Networks Retrieved from Machine learning mastery: https://machinelearningmastery.com/neural-networks-crash-course/ [7] Chauhan, N S (24 12, 2019) Decision Tree Algorithm — Explained Retrieved from Toward data science: https://towardsdatascience.com/decisiontree-algorithm-explained-83beb6e78ef4 [8] Deng, H (8 9, 2016) An Introduction to Random Forest Retrieved from Toward data science: https://towardsdatascience.com/random-forest3a55c3aca46d [9] Garbade, D M (11 8, 2018) Regression Versus Classification Machine Learning: What’s the Difference? Retrieved from Medium: https://medium.com/quick-code/regression-versus-classification-machinelearning-whats-the-difference-345c56dd15f7 [10] Goddard, N (21 9, 2016) Pros and cons of decision trees Retrieved from The University of Edinburgh: https://media.ed.ac.uk/media/Pros+and+cons+of+decision+trees/1_p4gyge5m [11] K, D (27 5, 2019) Top advantages and disadvantages of Decision Tree Algorithm Retrieved from Medium: https://medium.com/@dhiraj8899/top-5advantages-and-disadvantages-of-decision-tree-algorithm-428ebd199d9a [12] Kain, N K (11 22, 2018) Understanding of Multilayer perceptron (MLP) Retrieved from Medium: https://medium.com/@AI_with_Kain/understandingof-multilayer-perceptron-mlp-8f179c4a135f [13] Koehrsen, W (28 12, 2017) Random Forest Simple Explanation Retrieved from Medium: https://medium.com/@williamkoehrsen/random-forest-simpleexplanation-377895a60d2d [14] Koehrsen, W (31 8, 2018) An Implementation and Explanation of the Random Forest in Python Retrieved from Toward data sience: 42 https://towardsdatascience.com/an-implementation-and-explanation-of-therandom-forest-in-python-77bf308a9b76 [15] Kumar, N (23 2, 2019) Advantages and Disadvantages of Random Forest Algorithm in Machine Learning Retrieved from The Professionals Point: http://theprofessionalspoint.blogspot.com/2019/02/advantages-anddisadvantages-of-random.html [16] Louppe, G (2014) PhD [17] Machine Learning (n.d.) Retrieved from Sas: https://www.sas.com/en_us/insights/analytics/machine-learning.html [18] Machine learning definition (n.d.) Retrieved from Expertsystem: https://expertsystem.com/machine-learning-definition/ [19] Mijwil, M M (27 1, 2018) Artificial Neural Networks Advantages and Disadvantages Retrieved from linkedin: https://www.linkedin.com/pulse/artificial-neural-networks-advantagesdisadvantages-maad-m-mijwel [20] Mishra, A (24 2, 2018) Metrics to Evaluate your Machine Learning Algorithm Retrieved from Toward data sience: https://towardsdatascience.com/metrics-to-evaluate-your-machine-learningalgorithm-f10ba6e38234 [21] MULTILAYER PERCEPTRON NETWORKS APPLICATIONS & EXAMPLES OF BUSINESS USAGE (n.d.) Retrieved from theappsolutions: https://theappsolutions.com/blog/development/artificial-neural-networkmultiplayer-perceptron/ [22] Narkhede, S (9 5, 2018) Understanding Confusion Matrix Retrieved from Toward data sience: Understanding Confusion Matrix [23] Nicholson, C (n.d.) A Brief History of Perceptrons Retrieved from pathmind: https://pathmind.com/wiki/multilayer-perceptron [24] pandas 1.0.3 (n.d.) Retrieved from pypi: https://pypi.org/project/pandas/ [25] Perceptrons and Multi-Layer Perceptrons: The Artificial Neuron at the Core of Deep Learning (n.d.) Retrieved from missinglink: https://missinglink.ai/guides/neural-network-concepts/perceptrons-and-multilayer-perceptrons-the-artificial-neuron-at-the-core-of-deep-learning/ [26] Polamuri, S (22 5, 2017) HOW THE RANDOM FOREST ALGORITHM WORKS IN MACHINE LEARNING Retrieved from Dataaspirant: https://dataaspirant.com/2017/05/22/random-forest-algorithm-machinelearing/ [27] Preprocessing (n.d.) Retrieved from scikit-learn: https://scikitlearn.org/stable/modules/preprocessing.html [28] Priyadharshini (27 3, 2020) What is Machine Learning and How Does It Work Retrieved from Simplilearn: https://www.simplilearn.com/tutorials/machine-learning-tutorial/what-ismachine-learning 43 [29] Pros and cons of neural networks (n.d.) Retrieved from subcription.packtpub: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9 781788397872/1/ch01lvl1sec27/pros-and-cons-of-neural-networks [30] Pros and cons of decision tree (n.d.) Retrieved from media.ed.ac.uk: https://media.ed.ac.uk/media/Pros+and+cons+of+decision+trees/1_p4gyge5m [31] Real-Life and Business Applications of Neural Networks (n.d.) Retrieved from smartsheet: https://www.smartsheet.com/neural-network-applications [32] Reinstein, I (n.d.) Random Forests®, Explained Retrieved from KDnuggets: https://www.kdnuggets.com/2017/10/random-forestsexplained.html [33] scikit-learn (n.d.) Retrieved from wikipedia: https://en.wikipedia.org/wiki/Scikit-learn [34] Shung, K P (n.d.) Accuracy, Precision, Recall or F1? Retrieved from Toward data sience: https://towardsdatascience.com/accuracy-precision-recallor-f1-331fb37c5cb9 [35] Singh, H (2019) Understanding Random Forests Retrieved from Medium: https://medium.com/@harshdeepsingh_35448/understanding-random-forestsaa0ccecdbbbb [36] Tanner, G (4 9, 2018) What is Machine Learning and why is it important Retrieved from Medium: https://medium.com/datadriveninvestor/what-ismachine-learning-and-why-is-it-important-6779898227c1 [37] Tran, C (n.d.) Giới thiệu pandas Retrieved from vimentor: https://vimentor.com/vi/lesson/2-gioi-thieu-pandas-1 [38] Washington, U o (n.d.) Machine Learning: Classification Retrieved from Coursera: https://www.coursera.org/lecture/ml-classification/when-to-stoprecursing-fTlJU [39] What are multilayer perceptrons? (n.d.) Retrieved from deepai: https://deepai.org/machine-learning-glossary-and-terms/multilayer-perceptron [40] What is Python? Executive Summary (n.d.) Retrieved from python.org: https://www.python.org/doc/essays/blurb/ 44 APPENDIX APPENDIX Visualize data APPENDIX Preprocessing data 45 APPENDIX Split training and testing data APPENDIX Applying model on whole dataset APPENDIX Using oversampling 46 APPENDIX Using under-sampling 47 48 ... hereby declare that the Graduation Project APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS? ?? BEHAVIORS is the results of my own research and has never been published in any work... UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS? ?? BEHAVIORS SUPERVISOR: Dr Tran Duc Quynh STUDENT: Nguyen Huong... industry, random forest is mostly used to predict loyal customers and fraud customers These both aspects are equally important as loyal customers bring the most profit both in term of ability to purchase

Định dạng
Số trang	50
Dung lượng	1,41 MB