Outlier Detection And Class Imbalance Based On Mass Estimation Integrated With Evidential Reasoning

7 0 0
Outlier Detection And Class Imbalance Based On Mass Estimation Integrated With Evidential Reasoning

Đang tải... (xem toàn văn)

Thông tin tài liệu

Outlier Detection and Class Imbalance Based on Mass Estimation Integrated with Evidential Reasoning HOANG, Anh Japan Advanced Institute of Science and Technology Doctoral Dissertation OUTLIER DETECTION AND CLASS IMBALANCE BASED ON MASS ESTIMATION INTEGRATED WITH EVIDENTIAL REASONING HOANG, Anh Supervisor: Professor HUYNH, Van-Nam Graduate School of Advanced Science and Technology Japan Advanced Institute of Science and Technology (Knowledge Science) September 2021 Abstract Outlier detection and class imbalance modeling process play significant roles to enable effective and efficient algorithms for statistic analysis, data mining, machine learning, and knowledge discovery frameworks working on imbalanced datasets Although there has been vast literature on imbalanced datasets, the shortcomings of distance-based functions in response to a varied density of data points have not been solving yet The primary aim of this dissertation was to exploit a new alternative approach for local outlier detection tasks by fundamentally changing the way to measure the outlier degree of each data point To achieve this goal, we developed a mass-based approach to measure the dissimilarity between data points Then, we introduced a new outlier scoring method by employing mass-based dissimilarity and probability modeling to detect the local outliers in a given dataset The experimental study tested on artificial datasets and real application datasets show that our proposed MLOS approach is competitive with the state-of-the-art approaches In the same manner, to exploit the mass-based measurement for learning from the imbalanced datasets, we introduce the other two new methods for the class imbalance task The first model is a simple application of weighted sum The second model is an integration of the mass estimation and the Dempster-Shafer theory of evidence These proposed models were assessed by using significant evaluation metrics such as F1 score, Brier score, ROCAUC, and PR-AUC score testing on a wide range of benchmark datasets In addition, all experimental results were validated using the non-parametric statistical Wilcoxon signed ranks test This dissertation was the first study, regarding to our knowledge, to investigate the local outlier detection problem using mass-based dissimilarity measurement; the key finding was that the proposed MLOS approach presents an alternative way to score the outlierness of each data point in a given dataset Secondly, the simulation results showed that our proposed new models for the class imbalance task outperformed the other 11 competitive methods The experiments were conducted on a wide varying application domains, a varied imbalance ratio, and the number of instances Keywords: Imbalanced data, outlier detection, outlier modeling, massbased dissimilarity, weighted sum, Dempster-Shafer theory II Acknowledgment Writing the acknowledgment is always the nicest part! First, I would like to mention the International Cooperation Department, Ministry of Education and Training, Viet Nam, for providing me a scholarship as a part of Project 911 I could not start my Doctor of Philosophy program at Japan Advanced Institute of Science and Technology (JAIST) without this financial support Special thanks to my supervisor, Professor HUYNH Van-Nam, for all he has done, which I will never forget I truly appreciate his time spending to help me on many occasions with exceptional supports Thanks for sharing knowledge not only by doing scientific research but also by living a happy life Besides, I received generous encouragement and assistant from the HUYNH’s Lab members in this work, especially Mr Toan and Mr Vinh I would like to express my thankfulness to Professor HASHIMOTO Takashi, for his wonderful course of Introduction to Knowledge Science (K218) I enjoyed every minute of the lectures as well as the discussion in the official hours Professor DAM Hieu-Chi, thank you so much for caring about both what and how you have been teaching us I have learned a lot of fundamental concepts and methodologies for doing data scientist from your courses In addition, I would particularly like to mention enjoying time for playing soccer together My sincere thanks go to JAIST Supercomputer Unit for running software and services smoothly to conduct the experimental studies Thanks to Student Welfare Section for supporting my living at JAIST Thanks to Educational Service Section, Secretarial Service Section, and other sections at JAIST for unconditional help Last and most of all, I am grateful to the committee members and the audiences, who might give me the questions and comments That will help a lot to improve my work Finally, a lot of people have supported me, and I relish this opportunity to thank them Thanks to the members of JAIST’s Football Club, who may leave everything behind and enjoy doing sport together Thanks to my colleagues and friends who often ask me about my health and my progresses Especially, my parents are always believing in me My spouse and my son, thank you so much for being part of my life III Contents Abstract II Acknowledgment III Contents IV List of Figures VII List of Tables VIII List of Abbreviations IX Chapter Introduction 1.1 Research motivations 1.1.1 Outlier detection 1.1.2 Class imbalance 1.2 Research questions and contributions 1.2.1 Research questions 1.2.2 Main contributions 1.2.3 Future directions 1.3 Dissertation organization 1 5 Chapter Research background 2.1 Hierarchical partitioning method 2.2 Mass-based dissimilarity measurement 2.2.1 Definition 2.2.2 Definition 2.2.3 k -lowest mass-based dissimilarity neighbors 2.3 Dempster-Shafer theory 2.4 Evaluation metrics 2.5 Non-parametric statistical analysis 8 10 11 11 12 13 Chapter Outlier detection 15 3.1 Introduction 15 IV 3.2 Problem formulation 3.3 Literature review 3.3.1 Geometric outlier modeling 3.3.2 Semi-supervised outlier modeling 3.4 Proposed MLOS approach 3.4.1 Notations 3.4.2 Stage 1: Data preparation 3.4.3 Stage 2: Data partitioning technique 3.4.4 Stage 3: Outlier scoring 3.5 Experimental result 3.5.1 Experimental results on synthetic datasets 3.5.2 Experimental results on benchmark datasets 3.5.3 Non-parametric statistic test 3.6 Chapter conclusions Chapter Class imbalance 4.1 Introduction 4.2 Class imbalance statement 4.3 Methodology 4.3.1 Confidence estimation 4.3.2 Mass-based similarity measurement 4.3.3 Mass-based similarity weighted k -neighbor Sk -LMN approach 4.3.4 Mass-based similarity integrated with evidential reasoning: EMass approach 4.4 Experimental studies 4.4.1 Dataset description 4.4.2 Implementation details and evaluation metrics 4.4.3 Results and discussions 4.5 Chapter conclusions Chapter Summary and future works 18 18 18 20 21 23 23 23 24 28 28 34 44 46 48 48 50 51 51 51 52 55 58 58 62 63 69 71 Publications 73 References 74 Appendix A 84 Appendix B 89 V ... Dissertation OUTLIER DETECTION AND CLASS IMBALANCE BASED ON MASS ESTIMATION INTEGRATED WITH EVIDENTIAL REASONING HOANG, Anh Supervisor: Professor HUYNH, Van-Nam Graduate School of Advanced Science and. .. experiments were conducted on a wide varying application domains, a varied imbalance ratio, and the number of instances Keywords: Imbalanced data, outlier detection, outlier modeling, massbased dissimilarity,... detection 1.1.2 Class imbalance 1.2 Research questions and contributions 1.2.1 Research questions 1.2.2 Main contributions 1.2.3 Future directions 1.3 Dissertation

Ngày đăng: 29/10/2022, 01:13

Tài liệu cùng người dùng

Tài liệu liên quan