0

Openk data cleansing system a clustering based approach for detecting data anomalies

84 1 0
  • Openk data cleansing system   a clustering   based approach for detecting data anomalies

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Tài liệu liên quan

Thông tin tài liệu

Ngày đăng: 12/05/2022, 12:34

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS OPENK : DATA CLEANSING SYSTEM - A CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES Council: Information System Instructor: Assoc Prof Dr Dang Tran Khanh Reviewer: Dr Phan Trong Nhan Student: Nguyen Dinh Khuong - 1752306 Ho Chi Minh City, August 2021 - - KHOA:KH & KT Máy tính KHMT trình MSSV: 1752306 NGÀNH: COMPUTER SCIENCE KHM2 OPENK : DATA CLEANSING SYSTEM DETECTING DATA ANOMALIES - - - - CLUSTERING-BASED APPROACH FOR Learn requirements, analysis, design and implementation of data cleansing system running on web app platform Research and apply Edit-based similarity algorithms, using knowledge and methodologies from Algorithm Design and Analysis, Database Management System, Clustering Methods, Web development to provide reasonable and optimized approach in detecting and clustering cluster of anomalies data, which will be ready for the next steps Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data based on clustering methods Researching different related works - others data cleansing systems such as GoogleRefine, BigDansing, NADEEF, thereby making reasonable assessments and comparisons for the advantages and disadvantages of the current system After that, developing further functions performance and system optimization Apply K-NN methods (LD, Damerau LD, Hamming), Similarity (Jaro, Jaro-Winkler) methods and Key Collision (Fingerprint, N-gram Fingerprint) for detecting and clustering Test and evaluate the proposed system 02/02/2021 26/07/2021 All thesis PGS TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HỊA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc -Ngày 10 tháng 08 năm 2021 PHIẾU CHẤM BẢO VỆ LVTN! (Dành cho người hướng dẫn) Họ tên SV: Nguyễn Đình Khương MSSV: 1752306 Ngành (chuyên ngành): Khoa học máy tính ! Đề tài: OPEN : DATA CLEANSING SYSTEM – A CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES Họ tên người hướng dẫn: PGS.TS Đặng Trần Khánh Tổng quát thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính tốn: Windows, Python, …! Hiện vật (sản phẩm) Tổng quát vẽ: - Số vẽ: Bản A1: Bản A2: Khổ khác: - Số vẽ vẽ tay Số vẽ máy tính: Những ưu điểm LVTN: Developed a cleansing tool for improving (big) data quality in order to achieve the high utility in businesses Moreover, the student had finished the following: -! Studying Pandas, Numpy, JSON Python library, and other relevent programming tools -! Investigating algorithms for measuring text similarity using different methods -! Studying cleansing and validating data tools such as OpenRefine, Cerberus -! Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data based on the clustering method -! Build a visualization method for users to have a better view about the collected data -! Build an API-based library for the developer community Những thiếu sót LVTN: The thesis presentation can be improved Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không bảo vệ □ câu hỏi SV phải trả lời trước Hội đồng: a Point out a better functionality of OPEN! comparing with the known existing work/systems? 10 Đánh giá chung (bằng chữ: xs/giỏi, khá, TB): Xuất sắc Điểm: 10 /10 Ký tên (ghi rõ họ tên) PGS.TS Đặng Trần Khánh KHOA KH & KT MÁY TÍNH -Ngày 03 tháng 08 2021 Nguy ình Kh MSSV: 1752306 Ngành (chuyên ngành): Khoa h áy tính Openk: Data Cleansing System - A Clustering-based Approach for Detecting Data Anomalies TS Phan Tr ân ng: S nv -The student has developed a web application that supports users recognizing data anomalies by a clustering-based approach with some built-in methods -The student has employed modern technologies for development such as flask, jinja, pandas, numpy, html, css, javascript, and performed some basic empiriments (loading time, error, running time) -The system can connect to files and cloud-based database management systems -The way of identifying data anomalies based on a clustering approach does not really show the anomalies For example, it shows the two different texts as abnormal -The evaluation and comparison are simple and towards time than accuracy In addition, it does not clearly show how effective the system helps in anomaly detection -The system is inflexible to add more methods Moreover, how to choose the parameter values is a problem to a user (e.g., k parameter, the limitation of records loading from Azure database) a Would you please show a use-case in that a user can benefit from your system? b Any comparison with some related work (e.g., Open Refine)? c Good : 9/10 Ký tên (ghi rõ h tên) Phan Tr ân Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Acknowledgements First and foremost we would like to thank my supervisor Dr Dang Tran Khanh, not only for his academic guidance and assistance, but also for his patience and personal support which made me truly grateful I would like to guarantee that this research is my own, conducted under the supervision and guidance of Dr Dang Tran Khanh The result of my research is legitimate and has not been published in any forms prior to this All materials used within this researched are collected by myself, by various sources and are appropriately listed in the references section In addition, within this research, we also used the results of several other authors and organizations They have all been aptly referenced In any case of plagiarism, we stand by my actions and are to be responsible for it Ho Chi Minh city University of Technology therefore is not responsible for any copyright infringements conducted within my research GRADUATION THESIS Page 4/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Abstract At the moment, massive amounts of data are created every second over the internet, making the most efficient decisions has become a critical goal Assume that we had all of the information, but that extracting the valuable knowledge would be extremely difficult The following are the reasons for this assumption: data is not always clean or at least correct since data obtained from many sources may be redundant, some of them can be duplicated These data must be cleaned before they can be utilized for further processing Any inconsistencies or duplication in the datasets should be detected using a detection procedure Widowing, blocking, and machine learning are among of the methods that are utilized to identify anomalous data The goals of this thesis are to offer OpenK , a simple yet efficient data cleansing system based on clustering approaches In this scenario, a cluster will comprise all data that are similarity-based assumptions, is detected by several techniques: Nearest Neighbor (Levenshtein Distance, Damerau-Levenshtein Distance, Hamming Distance), Similarity Measurement (Jaro Similarity, Jaro-Winkler Similarity) and Key Collision (Fingerprints, N-gram Fingerprints) This tool will be evaluated in order to see how the efficiency of it and compare to other tool for better view of assessment We used airlines dataset from https://assets.datacamp com/production/repositories/5737/datasets and special case study Real Estate dataset which is crawled from https://batdongsan.com.vn OpenK also aids the user in loading and viewing data Beside that, CRUD procedures, Pagination, Toggle column ON/OFF, Sort column, and Search keywords are being used for analyzing and wrangling input data Keywords: Data Cleansing, Levenshtein Distance, Jaro-Winkler Similarity, Fingerprints, Anomaly detection GRADUATION THESIS Page 5/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Contents Introduction 10 1.1 Problem Statement 10 1.2 Objective 10 1.3 Scope 10 1.4 Thesis Structure 11 Theoretical Background 13 2.1 Related Works 13 2.2 Data Anomalies Detection 13 2.3 2.2.1 Conception 13 2.2.2 Existing methods 14 Clustering Methods 15 2.3.1 2.3.2 Key Collision 15 2.3.1.a Fingerprint 16 2.3.1.b N-gram Fingerprint 17 Nearest neighbors 18 2.3.2.a Hamming distance 19 2.3.2.b Levenshtein distance 20 2.3.2.c Damerau-Levenshtein distance 22 2.3.2.d Jaro Distance - Jaro-Winkler Distance 23 Methodologies And Design 26 GRADUATION THESIS Page 6/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 3.1 3.2 General Architecture 26 3.1.1 Main components 26 3.1.2 Detecting anomaly execution flow 28 3.1.3 Use-case of clustering data site 29 3.1.3.a Actor determination and following use-case 29 3.1.3.b Use-case diagram and specification 30 Existing System and Design 36 System Implementation 39 4.1 Technologies and Framework 39 4.2 Function implementation 40 System Evaluation 50 Thesis Denouement 54 6.1 Achievements 54 6.2 Assessment of Thesis Connotation 55 6.3 Future Advancement 55 APPENDIX A : USER MANUAL GRADUATION THESIS 59 Page 7/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering List of Figures 2.1 Example of applying fingerprint algorithm for name 16 2.2 Formula of Hamming distance calculation 19 2.3 3-bit binary cube for finding Hamming distance 20 2.4 Levenshtein Distance calculation formula 21 2.5 Example of Levenshtein distance calculation table 22 2.6 Example of Damerau-Levenshtein distance calculation table 22 2.7 Formula of Jaro similarity calculation 23 2.8 Jaro-Winkler similarity calculation example 24 2.9 Comparision of barcode correction using different techniques 25 3.1 Overall architecture of OpenK system 26 3.2 Data type format converter 27 3.3 Data cleansing component illustration 27 3.4 Clustering Operations illustration 28 3.5 Activity diagram of OpenK system 28 3.6 Use case diagram of Data site of OpenK system 31 3.7 Use case specification of viewing data 32 3.8 Use case specification of paging data 32 3.9 Use case specification of searching data keywords 32 3.10 Use case specification of sorting data column 33 3.11 Use case specification of export data 33 3.12 Use case specification of Hiding column data 34 GRADUATION THESIS Page 8/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 3.13 Use case specification of Manage data cluster 34 3.14 Use case specification of cluster data using knn method 35 3.15 Use case specification of cluster data using similarity method 35 3.16 Use case specification of cluster data using key collision method 36 3.17 Overall architecture of BigDansing 37 3.18 Overall architecture of NADEEF 38 4.1 Relation diagram of OpenK routing system 42 4.2 Flow chart diagram of Upload function implementation 43 4.3 Flow chart diagram of Data function implementation 44 4.4 Class diagram of clustering method 45 4.5 Flow of clustering data with KNN class 46 4.6 Flow of clustering data with Similarity class 47 4.7 Implementation code of clustering data with Fingerprint algorithm 48 4.8 Flow of clustering data with Fingerprint algorithm 49 5.1 Time performance for loading & visualizing input dataset of OpenK and OpenRefine 51 5.2 Time performance for detecting & clustering input dataset of OpenK and OpenRefine 52 5.3 Error percentage for detecting & clustering input dataset of OpenK and OpenRefine 52 GRADUATION THESIS Page 9/83 CONVERTER UPLOAD A VALID FILE TYPE CLICK HERE! 10 CONVERTER Choose format to change into Click submit 11 IMPORT 12 IMPORT Click submit UPLOAD A VALID FILE TYPE 13 IMPORT FILL IN ALL THESE VALUES FOR AZURE COSMOS DATABASE Click submit 14 DATA + After data is uploaded, it will be shown as the beautiful table view as it shows above + It has some basic info beforehand such as number of columns and records 15 DATA This will clustering data This will export current data 16 DATA select number of records show at one 17 DATA Select a column to hide Search for a keyword Click to sort column 18 DATA Data paging view depends on the number of records show at once Click cluster will redirect to clustering data function 19 DATA Data paging view depends on the number of records show at once Click cluster will redirect to clustering data function 20 CLUSTERING DATA Choose group to begin clustering Choose group to begin clustering Click here to cluster operation 21 CLUSTERING DATA 22 SAVE CHANGES AND FINISH Click here to save 23 ALL FOR INTRODUCTION TO MANUAL OF OPENK DATA CLEANSING SYSTEM THANK YOU AND IF THERE IS ANY QUESTION , SEND TO MY EMAIL WHICH IS: khuong.nguyenbkk17@hcmut.edu.vn 24 ... Engineering achieve higher quality repairs Metadata management and data quality dashboard Metadata management is to keep full lineage information about data changes, the order of changes, as well as maintaining... are: Pandas allows us to analyze big data and make conclusions based on statistical theories, Pandas can clean messy data sets, and make them readable and relevant, relevant data is very important... UI and interaction PANDAS - The name “Pandas" has a reference to both "Panel Data" , and “Python Data Analysis", is a software library written for the Python programming language for data manipulation
- Xem thêm -

Xem thêm: Openk data cleansing system a clustering based approach for detecting data anomalies ,