Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	130
Dung lượng	593,48 KB

Nội dung

Luận văn

, mn B GIÁO DC VÀ ÀO TO B QUC PHÒNG VIN KHOA HC VÀ CÔNG NGH QUÂN S ---------------e ̌f--------------- LNG TH DNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cu xây dng mt s gii pháp đm bo an toàn thông tin trong quá trình khai phá d liu) LUN ÁN TIN S TOÁN HC Hà N i - 2011 B GIÁO DC VÀ ÀO TO B QUC PHÒNG VIN KHOA HC VÀ CÔNG NGH QUÂN S ---------------ěf--------------- LNG TH DNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cu xây dng mt s gii pháp đm bo an toàn thông tin trong quá trình khai phá d liu) Chuyên ngành: Bo đm toán hc cho máy tính và h thng tính toán. Mã s : 62 46 35 01 LUN ÁN TIN S TOÁN HC Ngi hng dn khoa hc: 1. GIÁO S - TIN S KHOA HC H TÚ BO 2. PHÓ GIÁO S - TIN S BCH NHT HNG Hà Ni - 2011 Pledge I promise that this thesis is a presentation of my original research work. Any of the content was written based on the reliable references such as published papers in distinguished international conferences and journals, and books published by widely-known publishers. Results and discussions of the thesis are new, not previously published by any other authors. i Contents 1 INTRODUCTION 1 1.1 Privacy-preserving data mining: An overview . . . . . . . . . 1 1.2 Objectives and contributions . . . . . . . . . . . . . . . . . . 5 1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . 12 2 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Computational indistinguishability . . . . . . . . . . . 13 2.1.2 Secure multi-party computation . . . . . . . . . . . . . 14 2.2 Secure computation . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Secret sharing . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Secure sum computation . . . . . . . . . . . . . . . . . 16 2.2.3 Probabilistic public key cryptosystems . . . . . . . . . 17 2.2.4 Variant ElGamal Cryptosystem . . . . . . . . . . . . . 18 2.2.5 Oblivious polynomial evaluation . . . . . . . . . . . . 20 2.2.6 Secure scalar product computation . . . . . . . . . . . 21 2.2.7 Privately computing ln x . . . . . . . . . . . . . . . . . 22 3 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Privacy preserving frequency mining in 2PFD setting . . . . . 27 3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . 27 3.2.2 Definition of privacy . . . . . . . . . . . . . . . . . . . 29 3.2.3 Frequency mining protocol . . . . . . . . . . . . . . . 30 ii 3.2.4 Correctness Analysis . . . . . . . . . . . . . . . . . . . 32 3.2.5 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 34 3.2.6 Efficiency of frequency mining protocol . . . . . . . . 37 3.3 Privacy Preserving Frequency-based Learning in 2PFD Setting 38 3.3.1 Naive Bayes learning problem in 2PFD setting . . . . 38 3.3.2 Naive Bayes learning Protocol . . . . . . . . . . . . . . 40 3.3.3 Correctness and privacy analysis . . . . . . . . . . . . 42 3.3.4 Efficiency of naive Bayes learning protocol . . . . . . . 42 3.4 An improvement of frequency mining protocol . . . . . . . . . 44 3.4.1 Improved frequency mining protocol . . . . . . . . . . 44 3.4.2 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . 45 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY . 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Association rules and frequent itemset . . . . . . . . . 51 4.2.2 Frequent itmeset identifying in vertically distributed data 52 4.3 Computational and privacy model . . . . . . . . . . . . . . . 53 4.4 Support count preserving protocol . . . . . . . . . . . . . . . 54 4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Protocol design . . . . . . . . . . . . . . . . . . . . . . 56 4.4.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 57 4.4.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 59 4.4.5 Performance analysis . . . . . . . . . . . . . . . . . . . 61 4.5 Support count computation-based protocol . . . . . . . . . . 64 4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Protocol Design . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 65 4.5.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 67 4.5.5 Performance analysis . . . . . . . . . . . . . . . . . . . 68 4.6 Using binary tree communication structure . . . . . . . . . . 69 iii 4.7 Privacy-preserving distributed Apriori algorithm . . . . . . . 70 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5 PRIVACY PRESERVING CLUSTERING 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Privacy preserving clustering for the multi-party distributed data 76 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.2 Private multi-party mean computation . . . . . . . . . 78 5.3.3 Privacy preserving multi-party clustering protocol . . 80 5.4 Privacy preserving clustering without disclosing cluster centers 82 5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Privacy preserving two-party clustering protocol . . . 85 5.4.3 Secure mean sharing . . . . . . . . . . . . . . . . . . . 87 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 PRIVACY PRESERVING OUTLIER DETECTION 91 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Linear transformation . . . . . . . . . . . . . . . . . . 93 6.2.3 Privacy model . . . . . . . . . . . . . . . . . . . . . . 94 6.2.4 Private matrix product sharing . . . . . . . . . . . . . 95 6.3 Protocols for the horizontally distributed data . . . . . . . . . 95 6.3.1 Two-party protocol . . . . . . . . . . . . . . . . . . . . 97 6.3.2 Multi-party protocol . . . . . . . . . . . . . . . . . . . 100 6.4 Protocol for two-party vertically distributed data . . . . . . . 101 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 SUMMARY 107 Publication List 110 Bibliography 111 iv List of Phrases Abbreviation Full name PPDM Privacy Preserving Data Mining k-NN k-nearest neighbor EM Expectation-maximization SMC Secure Multiparty Computation DDH Decisional Diffie-Hellman PMPS Private Matrices Product Sharing SSP Secure Scalar Product OPE Oblivious polynomial evaluation ICA Independent Component Analysis 2PFD 2-part fully distributed setting FD fully distributed setting c ≡ computational indistinguishability v List of Tables 4.1 The communication cost . . . . . . . . . . . . . . . . . . . . 62 4.2 The complexity of the support count preserving protocol . . . 63 4.3 The parties’s time for the support count preserving protocol 64 4.4 The communication cost . . . . . . . . . . . . . . . . . . . . 68 4.5 The complexity of the support count computation protocol . 69 4.6 The parties’s time for the support count computation protocol 70 6.1 The parties’s computational time for the horizontally distributed data 105 6.2 The parties’s computational time for the vertically distributed data 105 vi List of Figures 3.1 Frequency mining protocol . . . . . . . . . . . . . . . . . . . . 33 3.2 The time used by the miner for computing the frequency f . 38 3.3 Privacy preserving protocol of naive Bayes learning . . . . . . 41 3.4 The computational time for the first phase and the third phrase 43 3.5 The time for computing the key values in the first phase . . 43 3.6 The time for computing the frequency f in third phrase . . . 44 3.7 Improved frequency mining protocol . . . . . . . . . . . . . . 47 4.1 Support count preserving protocol. . . . . . . . . . . . . . . . 58 4.2 The support count computation protocol. . . . . . . . . . . . 66 4.3 Privacy-preserving distributed Apriori protocol . . . . . . . . 72 5.1 Privacy preserving multi-party mean computation . . . . . . 79 5.2 Privacy preserving multi-party clustering protocol . . . . . . 81 5.3 Privacy preserving two-party clustering . . . . . . . . . . . . 86 5.4 Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . . 89 6.1 Private matrix product sharing (PMPS). . . . . . . . . . . . . 96 6.2 Protocol for two-party horizontally distributed data. . . . . . 98 6.3 Protocol for multi-party horizontally distributed data. . . . . 101 6.4 Protocol for two-party vertically distributed data. . . . . . . . 103 vii Chapter 1 INTRODUCTION 1.1. Privacy-preserving data mining: An overview Data mining plays an important role in the current world and provides us a powerful tool to efficiently discover valuable information from large databases [25]. However, the process of mining data can result in a viola- tion of privacy, therefore, issues of privacy preservation in data mining are receiving more and more attention from the this community [52]. As a result, there are a large number of studies has been produced on the topic of privacy-preserving data mining (PPDM) [72]. These studies deal with the problem of learning data mining models from the databases, while protecting data privacy at the level of individual records or the level of organizations. Basically, there are three major problems in PPDM [8]. First, the organizations such as government agencies wish to publish their data for researchers and even community. However, they want to preserve the data privacy, for example, highly sensitive financial and health private data. Second, a group of the organizations (or parties) wishes to together obtain the mining result on their joint data without disclosing each party’s privacy information. Third, a miner wishes to collect data or obtain the data mining models from the individual users, while preserving privacy of each user. Consequently, PPDM can be formed into three following areas depending on the models of information sharing. Privacy-preserving data publishing: The model of this research con- sists of only an organization, is the trusted data holder. This organization wishes to publish its data to the miner or the research community such that the anonymized data are useful for the data mining applications. For example, some hospitals collect records from their patients for the some required 1

Ngày đăng: 04/12/2013, 13:56

Xem thêm