Privacy preserving in sharing data

Trang 1

NHAM LAP HANH

PRIVACY PRESERVING IN SHARING DATA

Major: : COMPUTER SCIENCE

Code: : 8480101

Trang 2

Supervisor : Dr TRUONG TUAN ANH

Examiner 1: Assoc Prof Dr NGUYEN TUAN DANG Examiner 2: Assoc Prof Dr NGUYEN VAN VU

This master’s thesis is defended at HCM City University of Technology, VNU - HCM City on July 11, 2023

Master’s Thesis Committee:

1 Chairman: Assoc Prof Dr LE HONG TRANG 2 Secretary: Dr PHAN TRONG NHAN

3 Examiner 1: Assoc Prof Dr NGUYEN TUAN DANG 4 Examiner 2: Assoc Prof Dr NGUYEN VAN VU 5 Commissioner: Dr TRUONG TUAN ANH

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS

COMMITTEE COMPUTER SCIENCE AND HEAD OF FACULTY OF

ENGINEERING

Trang 3

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: NHAM LAP HANH Student ID: 1991020

Date of birth: November 23, 1991 Place of birth: HCM City

Major: Computer Science Major ID: 8480101

I THESIS TITLE: PRIVACY PRESERVING IN SHARING DATA / BẢO VỆ

QUYỀN RIÊNG TƯ TRONG DỮ LIỆU CHIA SẺ

II TASKS AND CONTENTS:

- Researched data anonymization

- Researched and analyzed the strengths and weaknesses of existing data anonymization solutions

- Based on this, proposed an improved algorithm for data anonymization - Implemented and evaluated the improved algorithm

III THESIS START DAY: February 01, 2023

IV THESIS COMPLETION DAY : May 31, 2023

V SUPERVISOR: Dr TRUONG TUAN ANH

Ho Chi Minh City, June 09, 2023

Trang 4

ACKNOWLEDGEMENTS

I am deeply grateful to Dr Truong Tuan Anh for his wholehearted help and guidance during my thesis work He provided me with valuable insights and feedback, and his support was essential to the completion of my thesis

I would also like to thank Assoc Prof Dr Nguyen Tuan Dang and Assoc Prof Dr Nguyen Van Vu for their useful comments on how to complete my thesis before defending it Their advice was invaluable, and I am very grateful for their help

Besides, on behalf of all students, I would like to express my gratitude to the teachers of Ho Chi Minh City University of Technology, especially those in the Faculty of Computer Science and Engineering Your enthusiastic guidance and support have been invaluable to us, both during our time at the university and in our later career paths

Trang 5

ABSTRACT

Trang 6

TÓM TẮT LUẬN VĂN

Trang 7

THE COMMITMENT OF THE THESIS’ AUTHOR:

I hereby declare that this is my own research work

The data and results presented in the thesis are honest and have never been published in any previous works

Trang 8

Table of contents

THE TASK SHEET OF MASTER’S THESIS ⅰ ACKNOWLEDGEMENTS ⅱ ABSTRACT ⅲ TÓM TẮT LUẬN VĂN ⅳ THE COMMITMENT OF THE THESIS’ AUTHOR: ⅴ

CHAPTER 1: INTRODUCTION 1

1.1 General introduction 1

1.2 Research problem 2

1.3 Objectives of the topic 4

1.4 Research significances 4

CHAPTER 2: OVERVIEW 5

2.1 Information Anonymization 5

2.2 Privacy Preserving Data Mining 11

2.3 Related works 14

2.3.1 One-Pass K-Means Algorithm 17

2.3.2 K-Anonymity Algorithm Based On Improved Clustering 18

2.3.3 A clustering-based anonymization approach for privacy-preserving in the healthcare cloud 19

2.3.4 Mondrian 19

2.3.5 NonHomogenous Anonymization Approach Using Association Rule Mining For Preserving Privacy 20

2.3.6 M3AR Algorithm 21

CHAPTER 3: THE PROPOSED ALGORITHM 27

3.1 Definitions 27

3.2 Budgets Calculation 28

3.3 The impact of member migration on association rules 32

3.4 The proposed algorithm 34

Trang 9

Table of Figures

Figure 1.1: Overview of Privacy-Preserving Data Publishing 1

Figure 1.2 Illustrate the process of Anonymize and mine 3

Figure 2.1: An example of domain and value generalization hierarchies 8

Figure 2.2: An example of generating Association Rule using Apriori algorithm 14

Figure 2.3: An example of attacks on k- anonymity 15

Figure 2.4: A classification tree example for a category attribute 16

Figure 2.5: Core process of the Mondrian algorithm 20

Figure 2.6: The Block Diagram Of Nonhomogenous Anonymization Approach 21

Figure 2.7: The M3AR algorithm pseudocode 24

Figure 2.8: Disperse function of M3AR algorithm 25

Figure 4.1: Results on metric Lost Rules Percentage 49

Figure 4.2: Results on metric New Rules Percentage 50

Figure 4.3: Results on metric Different Rules Percentage 51

Figure 4.4: Results on metric Average Group Size 52

Trang 10

Table of Tables

Table 2.1 k-anonymity, types of attribute in dataset 6

Table 2.2: Microdata table of criminal records 9

Table 2.3: A 3-anonymous version of Table 2.2 10

Table 2.4 : Example of Association Rule Mining 12

Table 2.5: Example of 3-anonymity on tuples member migration technique 22

Trang 11

CHAPTER 1: INTRODUCTION

1.1 General introduction

Currently, the volume of data generated increases exponentially every year This data has brought many benefits to many organizations, such as storing, sharing, and exploiting data using data mining techniques Through data mining, some valuable information can be discovered from that shared data Massive data generated from various sources can be processed and analyzed to support decision-making Among this data, more and more personal information is contained within, it leads to serious privacy concerns Therefore, analyzing privacy-preserving data becomes very important

Besides, governments have published data privacy rules, such as HIPAA (Health Insurance Portability and Accountability Act) in the US and the Data Protection Regulation GDPR in Europe in order to control the use and sharing of data to protect user privacy Any organization that is found to be disclosing user information will be subject to severe fines

Trang 12

who intend to discover sensitive information about individuals Therefore, the goal of PPDP techniques is to modify data by making it less specific in a way that protects the privacy of individuals; while aiming to maintain the usefulness of anonymous data The essence of PPDP is to create datasets that have good utility for various tasks since, typically, all potential use scenarios for the data are unknown at the time of publication For example, under open data initiatives, it is not possible to identify all data recipients Therefore, any data controller involved in sharing personal data should apply privacy protection mechanisms

1.2 Research problem

Many PP models have been proposed to adopt this situation These models have been developed to consider different attack scenarios against data For example, assuming an attacker has the basic knowledge to varying degrees could lead to information disclosure Examples of well-known patterns are k-anonymity [2], l-diversity [3], t-closeness [4] and differential privacy [17] Among these models, k-anonymity is focused on and used widely by researchers and organizations because this model is realistic and can be easily achieved in most cases Furthermore, although it has been shown that k-anonymity is vulnerable to specific attacks, it allows for general-purpose data to be published with reasonable utility This is in contrast to more robust models (e.g., differential privacy) which might hamper data quality in order to preserve a stricter guarantee of privacy [18] These characteristics may make k-anonymity attractive to practitioners, who can adopt it within their organization for an anonymity strategy with a formal guarantee of privacy

Trang 13

Some benefits of the information technologies are only possible through the collection and analysis of (sometimes sensitive) data However, transforming the data may also reduce its utility, resulting in inaccurate or even infeasible extraction of knowledge through data mining Privacy-Preserving Data Mining (PPDM) methodologies are designed to guarantee a certain level of privacy, while maximizing the utility of the data, such that data mining can still be performed on the transformed data efficiently

Figure 1.2 Illustrate the process of Anonymize and mine

A simple example of data sharing is shown in Figure 1.2: data owner of dataset D containing sensitive data, approximate data, and identifiers Some other data miners require data for their purposes (to analyze and mine the data using association rules techniques to extract valuable information) To protect the privacy of users with sensitive data, data owners must use PP algorithms (e.g anonymize k) to anonymize the original data from dataset D to dataset D' for sharing with this data mining tool The data miner will then use a mining technique (e.g association rules) to mine dataset D' to get valuable information

Trang 14

1.3 Objectives of the topic

As mentioned, many studies have been conducted on anonymizing data before releasing it to third parties Maximizing the utility of data and minimizing privacy risk are two opposing goals As a result, the goal of this thesis is to investigate the state of algorithms that preserve data privacy while ensuring good data mining results Finally, this thesis will propose an efficient algorithm for achieving k-anonymity that outperforms current state-of-the-art algorithms Additionally, the proposed algorithm preserves significant association rules in k-anonymity data so that the data mining process based on association rule mining can preserve valuable information as in the original data

1.4 Research significances

Trang 15

CHAPTER 2: OVERVIEW

2.1 Information Anonymization

Information anonymization [15] is the way toward expelling by and by identifiable data from informational indexes, to make the general population unknown about whom the information describes It allows the exchange of information over a limit, as between two offices inside focus or between two offices, though lessening the peril of accidental uncovering, and in bound conditions in an exceedingly way that grants investigation and examination post-anonymization This system is utilized as a part of undertakings to expand the security of the information while enabling the information to be broken down and utilized It changes the information that will be utilized or distributed to keep the distinguishing proof of key data There are a number of different information anonymization methods, each with its own advantages and disadvantages Some of the most common methods include:

● k-anonymity [2]: This method ensures that no individual in the dataset can be uniquely identified by a combination of values of sensitive attributes For example, if a dataset contains the names, ages, and addresses of individuals, k-anonymity would ensure that no two individuals have the same name, age, and address

● l-diversity [3]: This method ensures that there are at least l different values for each sensitive attribute in the dataset For example, if a dataset contains the ages of individuals, l-diversity would ensure that there are at least l different ages in the dataset

Trang 16

Among these models, k-anonymity is focused on and used widely by researchers and organizations because this model is realistic and can be easily achieved in most cases Furthermore, although it has been shown that k-anonymity is vulnerable to specific attacks, it allows for general-purpose data to be published with reasonable utility This is in contrast to more robust models (e.g., differential privacy) which might hamper data quality in order to preserve a stricter guarantee of privacy [18] These characteristics may make k-anonymity attractive to practitioners, who can adopt it within their organization for an anonymity strategy with a formal guarantee of privacy

K-Anonymity

The k-anonymity [2] model is an approach to protect data from individual identification It works by ensuring that each record of a table is identical to at least k−1 other records with respect to a set of privacy-related attributes, called quasi-identifiers (Table 2.1 shows 3 different types of attribute), that could be potentially used to identify individuals by linking these attributes to external data sets

Table 2.1 k-anonymity, types of attributes in a dataset

Type of Attributes Description Example

Explicit identifier (ID)

Attributes explicitly identify

people or objects ID, NAME

Quasi attributes (QIDs)

A collection of characteristics that can be used to identify people or objects MARITAL STATUS, AGE, SEX, ZIP CODE Sensitive attributes (SA) Sensitive information to be concealed CRIME, SALARY, DISEASE

This anonymization can be done by Generalization as well as Suppression

Trang 17

● Generalization is the way toward changing over an incentive into a less

particular general term Generalization is based on a domain generalization hierarchy and a corresponding value generalization hierarchy on the values in the domains Typically, the domain generalization hierarchy is a total order and the corresponding value generalization hierarchy a tree, where the parent/child relationship represents the direct generalization/specialization relationship Figure 2.1 illustrates an example of possible domain and value generalization hierarchies for the quasi-identifying attributes of our example

○ Attribute (AG): Generalization is performed at the segment level; all the qualities in the section are generalized at a speculation step

○ Cell (CG): Generalization can likewise be performed on a solitary cell; at long last a summed up table may contain, for a particular section and values at various levels of generalization

● Suppression comprises in averting delicate information by evacuating it

Suppression can be connected at the level of a single cell, whole tuple, or whole segment, permitting diminishing the measure of speculation to be forced to accomplish k-anonymity

○ Tuple (TS): Suppression is performed at column level; suppression operation evacuates entire tuple

○ Attribute (AS): Suppression is performed at segment level; suppression operation shrouds every one of the estimations of a segment

Trang 18

Figure 2.1: An example of domain and value generalization hierarchies [1] For example, consider the criminal records in Table 2.2, among the attributes, name is the identifier (ID), marital status, age, and ZIP code are the quasi-identifier (QIDs) (the combination of this may decide an exact person name with ID and Crime), crime is the sensitive attribute (SA) (i.e., a user's private information that we do not want to disclose) For instance, the set of values

{Separated, 29, 32042} of {marital-status, age, ZIP code} will identify the tuple

Trang 19

Table 2.2: Microdata table of criminal records

ID QIDs SA

Tuple# Name Marital Status Age ZIP Code Crime

1 Joe Separated 29 32042 Murder

2 Jill Single 20 32021 Theft

3 Sue Widowed 24 32024 Traffic

4 Abe Separated 28 32046 Assault

5 Bob Widowed 25 32045 Piracy

6 Amy Single 23 32027 Indecency

After k-anonymity, Table 2.3 shows a 3-anonymous version of Table 2.2, where anonymization is achieved via generalization at the attribute level [3], i.e., if two records contain the same value at a QIDs, they will be generalized to the same value at the QIDs as well, which means that each tuple has at least three other tuples sharing the same values of the QIDs To achieve anonymity, the ID has been removed and the QIDs have been generalized:

● the marital-status has been replaced with a less specific but semantically

consistent description

Trang 20

Table 2.3: A 3-anonymous version of Table 2.2

QIDs SA

Tuple# EQ Marital Status Age ZIP Code Crime

1

Not Married [25-30) 3204* Murder

4 Not Married [25-30) 3204* Assault

5 Not Married [25-30) 3204* Piracy

2

Not Married [20-25) 3202* Theft

3 Not Married [20-25) 3202* Traffic

6 Not Married [20-25) 3202* Indecency

Trang 21

2.2 Privacy Preserving Data Mining

With the large amount of data collected, organizations can greatly benefit from the knowledge extraction of this available information The process of extracting patterns (knowledge) from large data sets, which can then be represented and interpreted is called Data Mining Three of the most common approaches in

machine learning are Association rule mining , classification and clustering , where Association rule mining , classification are supervised learning techniques, and clustering is an unsupervised learning mechanism

Among the machine learning techniques, Association rule mining [19] is

one of the most popular techniques in data mining An association rule mining algorithm is an algorithm that mines the association between things and is often used to mine the association knowledge between things For instance, it can find potential connections between disease from the patient records of habit and gender, which can help doctors discover and anticipate the patient's ability to get sick and prevent it in time

The algorithm uses a support-confidence framework to evaluate the value of association rules 𝑟 { 𝐴 , 𝐵 } → { 𝐶 } :

● The support: The frequency with which an item or itemset appears in the dataset D

𝑠𝑢𝑝𝑝𝑜𝑟𝑡 ( 𝑟 ) = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 , 𝐶 ) 𝑁

Intuitively, this metric reflects the usefulness of the rule 𝑟

● The confidence: The likelihood that a rule is correct or true, given the occurrence of the antecedent and consequent in the dataset D

Trang 22

Table 2.4 : Example of Association Rule Mining

Age Gender Zip Code SmokingHabit Disease

34 F 110092 Smoker Flu 58 M 110032 Smoker Cancer 46 F 110156 Nonsmoker Cancer 26 F 113398 Smoker AIDS 18 F 113301 Nonsmoker None 42 M 110045 Smoker Flu 50 M 110087 Nonsmoker Asthma 52 F 110076 Smoker Cancer 38 F 118970 Nonsmoker Flu 45 M 110045 Smoker Cancer

For example: Giving Table 2.4, we find association rules based on Age and SmokingHabit columns Let assume, the minimum threshold of support value min_sup=30% and minimum threshold of confidence value min_conf= 30% A rule will be considered as a significant rule if its support value is greater than 30%, confidence value is greater than 30%

We find the following rule: If a person is above 40 and he is a smoker then he is more likely to have cancer

𝑟

1 = 𝐴𝑔𝑒 ≥ 40 , 𝑆𝑚𝑜𝑘𝑖𝑛𝑔𝐻𝑎𝑏𝑖𝑡 = 𝑆𝑚𝑜𝑘𝑒𝑟 { } → 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = 𝐶𝑎𝑛𝑐𝑒𝑟 { } The calculation of support and confidence of rule : 𝑟

1 (2.1) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 ( 𝑟

1 ) = 3

Trang 23

(2.2) 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ( 𝑟

1 ) = 3 4 = 0 75

As 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 ≥0 3 and 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ≥0 3 , the above rule will be considered as a significant rule

In fact, not all rules are interesting to mine, association rule mining algorithms only mine significant rules (rules meet the minimum support and minimum confidence thresholds) That means association rule R’ of dataset D’ after anonymization is similar or same as the association rule R of the original dataset D Therefore, the goal of PPDM is to try to uphold these important rules Several algorithms are used in association rule mining, each with its own strengths and weaknesses

● Apriori algorithm : One of the most popular association rule mining

algorithms is the Apriori algorithm The Apriori algorithm is based on the concept of frequent itemsets, which are sets of items that occur together frequently in a dataset The algorithm works by first identifying all the frequent itemsets in a dataset, and then generating association rules from those itemsets

● FP-Growth algorithm : In large datasets, FP-growth is a popular method for

mining frequent item sets It generates frequent itemsets efficiently without generating candidate itemsets using a tree-based data structure called the FP-tree As a result, it is faster and more memory efficient than the Apriori algorithm when dealing with large datasets

● Eclat algorithm : Equivalence Class Transformation, or Eclat is another

Trang 24

Figure 2.2: An example of generating Association Rule using Apriori algorithm

2.3 Related works

Many PP models have been proposed to adopt this situation Examples of well-known patterns are k-anonymity [2], l-diversity [3], t-closeness [4] and differential privacy [17]

Trang 25

Figure 2.3: An example of attacks on k-anonymity

Trang 26

process by maintaining association rules that are significant to the data mining process, Therefore it gives better results when run association rule mining

The trade-off between data privacy and data quality is important in the PP problem Most of the k-anonymity approaches have used a variety of metrics as the basis for algorithm operations to minimize data information loss, such as Precision [2], IL [5,7]

The definition of information loss metric [5, 7] is calculated as below:

Supposing that a numeric attribute in a tuple, the original value x is generalized to 𝑥 , where is the minimum of the equivalence class and

𝑚𝑖𝑛 , 𝑥 𝑚𝑎𝑥

[] 𝑥

𝑚𝑖𝑛

is the maximum of the equivalence class Max and Min is the maximum and 𝑥

𝑚𝑎𝑥

minimum value of the attribute in the whole domain Then the information loss (IL) of the tuple on the numeric attribute is defined as Equation (1)

(2.3) 𝐼𝐿 = 𝑥 𝑚𝑎𝑥 − 𝑥 𝑚𝑖𝑛

𝑀𝑎𝑥 − 𝑀𝑖𝑛

For a category attribute, we usually need to build a classification tree at first As shown in Figure 2.4, it is a classification tree example for a category attribute Supposing that the value of a tuple is generalized from e to c Then the information loss of the tuple on the attribute is defined as Equation (2)

(2.4) 𝐼𝐿 = 𝑠𝑖𝑧𝑒 ( 𝑐 ) 𝑆𝑖𝑧𝑒

Trang 27

Finally, the average IL Equation (3) of all the tuples is the information loss of the whole data set after generalization

(2.5) 𝐼𝐿 = 𝑖 = 1

𝑚 ∑ 𝐼𝐿 𝑖

𝑚

where m is the number of all attributes

Another well-known metric is Discernibility Metric (DM): This metric measures how indistinguishable a record is from others, by assigning a penalty to each record, equal to the size of the EQ to which it belongs [25] If a record is suppressed, then it is assigned a penalty equal to the size of the input dataset The overall DM score for a k-anonymized dataset D’ is defined by:

(2.6) 𝐷𝑀 = ∀ 𝐸𝑄𝑠 𝑡 | 𝐸𝑄 |≥ 𝑘 ∑ | 𝐸𝑄 | 2 + ∀ 𝐸𝑄𝑠 𝑡 | 𝐸𝑄 | < 𝑘 ∑ 𝑡𝑜𝑡𝑎𝑙 * 𝐸𝑄 | |

where total is the number of records and |EQ| is the size of the equivalence classes (anonymized groups) created after performing the anonymization The idea behind this metric is that larger EQs represent more information loss, thus lower values for this metric are desirable

2.3.1 One-Pass K-Means Algorithm

Wei and Lin [20] proposed a two-level algorithm called one-pass k-mean (OKA) In the first level, it clusters the data using k-means for one iteration, and in the second level it checks the clusters which have more than k records and distributes them to other clusters that have less than k records

Trang 28

● Clustering Stage: The algorithm then randomly picks K records as the seeds to build K clusters Then, for each record 𝑟 ∈ 𝐷 , the algorithm finds the cluster that is closest to , assigns to this cluster, and subsequently updates 𝑟 𝑟 the centroid of this cluster

● Adjustment Stage: After the clustering stage, a set of clusters are constructed, but some of these clusters might contain less than k records Therefore, further adjustment is needed to satisfy the k-anonymity constraint The goal of the adjustment stage is to adjust the set of clusters constructed in the clustering stage such that every cluster contains at least k records It is important that such an adjustment should also try to minimize the total information loss

2.3.2 K-Anonymity Algorithm Based On Improved Clustering

Zheng et al [22] have proposed an improved algorithm to achieve k-anonymity using clustering techniques The quality of the cluster data has been greatly improved compared to other existing algorithms such as single-pass k-means This algorithm yields competitive results in terms of information loss unlike other existing algorithms

Trang 29

2.3.3 A clustering-based anonymization approach for privacy-preserving in the healthcare cloud

The core idea of this algorithm is divided includes two phases:

● The first phase is using the normal distribution function for deleting less frequent data which are more at risk

● The second phase is using the k-means++ algorithm to improve data

clustering in the k-anonymity method that leads to raising significant privacy preservation while it reduces the information loss and increases data quality The k-means++ clustering technique selects similar data and puts them in suitable clusters After that, the data are anonymized by the k-anonymity method All of the raw data’s properties must be identified, and according to the type of data, the proper method and model for privacy must be selected Due to their direct connection to privacy, explicit identifiers must be

removed before publication The QI must be identified and processed in the data processing phase using the normal distribution function

2.3.4 Mondrian

Trang 30

up the frequencies for each of the unique values in the attribute until the median position is found The value at which the median is found becomes the split value The process of the algorithm is illustrated in figure 2.5

Figure 2.5: Core process of the Mondrian algorithm

2.3.5 NonHomogenous Anonymization Approach Using Association Rule

Mining For Preserving Privacy

Trang 31

Figure 2.6: The Block Diagram Of Nonhomogenous Anonymization Approach Using Association Rule Mining For Preserving Privacy [13]

K-anonymity algorithms like GCCG, Mondrian, and OKA attempt to minimize information loss, but they do so by using generalization or suppression Generalization reduces the granularity of representation by grouping attribute values into ranges For example, date of birth could be generalized to year of birth Suppression removes the value of an attribute completely Both of these methods reduce the risk of identification, but they also reduce the accuracy of data mining applications.These algorithms are too general and do not focus on maintaining data quality for specific data mining techniques In practice, data is modified to be mined by a specific technique Therefore, these algorithms are not highly applicable in practice

Trang 32

mined technique In [10, 11, 12], the authors proposed and demonstrated the novel idea and a general algorithm based on this approach This Migrate Member technique only replaces necessary cell values by other ones in the current value domain

For example, Table 2.5 illustrates a tuple member mitigation of 3 QIDs attribute { 𝑚𝑎𝑟𝑖𝑡𝑎𝑙 𝑠𝑡𝑎𝑡𝑢𝑠 , 𝑎𝑔𝑒 , 𝑍𝐼𝑃 𝐶𝑜𝑑𝑒 } in order to anonymize Table 2.2 to obtain 3-anonymity: QIDs cell values of tuples ID 4, 5 will be replace with the pair values { 𝑀𝑎𝑟𝑖𝑡𝑎𝑙 𝑆𝑡𝑎𝑡 = 𝑆𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑑 , 𝐴𝑔𝑒 = 29 , 𝑍𝐼𝑃 𝐶𝑜𝑑𝑒 = 32042 } Tuples with the ID 3,6 will be replaced with the set of values from tuple ID 2

Table 2.5: Example of 3-anonymity of dataset D based on tuples member migration

QIDs SA

Tuple# EQ Marital Stat Age ZIP Code Crime

1 1 Separated 29 32042 Murder 4 Separated 29 32042 Assault 5 Separated 29 32042 Piracy 2 2 Single 20 32021 Theft 3 Single 20 32021 Traffic 6 Single 20 32021 Indecency

Trang 33

Besides, it possesses other advantages as follows: no difference between numerical and category attributes; no need to build hierarchies for attribute values based on generality And finally, when anonymization the receiver will sense that modified dataset D’ has never been modified

Algorithm M3AR is divided into 3 processing stages

Trang 34

Figure 2.7: The M3AR algorithm pseudocode

● Second is the Process stage and also is the core stage: in each while loop, if SelG is null then randomly select a group in UG to assign to SelG and find

the most “useful” group g in the rest groups that performs a Migrate Member

Trang 35

that, after MM, there is at most a k-unsafe group in set {SelG,g} after the Migrate Member operation (theorem 1), the k-unsafe group will be assigned to SelG and processed in next loop Otherwise if there exists no k-unsafe group in {SelG, g} then SelG=null, and a new k-unsafe group in UG is selected randomly and processed in next loop

Figure 2.8: Disperse function of M3AR algorithm

● Finally is the Disperse stage, When while loop ends, groups in UM( the will be dispersed by Disperse function Time complexity of this stage is mainly in while loop because processing groups in UM is much fewer than in while loop, normally |UM|=0 Line 6 of Disperse function, selecting the most “useful” group follow the rule: descending priority order of 2 factors: fewer in number of rule r with 𝑏𝑢𝑑𝑔𝑒 𝑡 (if t migrates to g); less cost

𝑟 < 0

Trang 37

CHAPTER 3: THE PROPOSED ALGORITHM

The proposed algorithm aims to anonymize data using k-anonymity while also preserving significant association rules in the database To do this, the algorithm must find suitable changes to the data that will not cause the significant association rules to be lost The related solution, M3ar, uses tuple member migration to achieve data anonymization However, M3ar has some advantages and weak points The proposed algorithm will extend M3ar to address these weaknesses and improve the overall performance of the algorithm

3.1 Definitions

The algorithm will use the tuples member migration (MM) technique discussed in section 2 of the author M3AR algorithm Before going the proposed algorithm, we need to go through some definitions:

Definition : A group is a set of tuples Moreover, all tuples in a group must

have the same QI values A group satisfies k-anonymity if it has at least k tuples or has no tuples in it Otherwise, we call this group as an unsafe group

Risk : A group that has the number of tuples m (m>0) will be estimated risk through

the function

(3.1)

Definition : Let 𝑚𝑔𝑟𝑡 ( 𝑔 be possible migrant directions between two 𝑖 , 𝑔

𝑗 )

groups gi and gj Then, there exists two separate migrant directions, those are from gi to gj with symbol 𝑚𝑔𝑟𝑡 ( 𝑔 and from gj to gi with symbol

𝑖 → 𝑔

𝑗 ) 𝑚𝑔𝑟𝑡 ( 𝑔

𝑖 ← 𝑔 𝑗 )

So 𝑚𝑔𝑟𝑡 ( 𝑔 And with two groups

Trang 38

Definition : A Member Migration operation 𝑚𝑔𝑟𝑡 ( 𝑔 is “ valuable ”

𝑖 → 𝑔 𝑗 )

when the risk of data is decreased after performing that Member Migration operation

3.2 Budgets Calculation

The main objective of algorithm is to try to retain the association rules while guaranteeing k-anonymity However, it is difficult to retain all association rules because the number of the association rules may be very big Normally, the data mining process will consider association rules which occur frequently in the database Therefore, the algorithm should try to retain these rules We call these rules as the significant rules In the algorithm, two threshold values will be provided to specify whether an association rule is significant or not significant That is the min_sup 𝑠 and min_conf An association rule is significant if its support value

𝑚 𝑐

𝑚

is greater than min_sup and its confidence value is also greater than min_conf Conversely, the association rule is insignificant

The goal of the proposed algorithm is to turn unsafe groups into safe ones by finding the changes that will achieve this transformation The algorithm also preserves the significant association rules of the database, so it has to find the appropriate changes that will not affect these rules In this section, we will estimate

some “ budgets ” that the algorithm will use to find these changes

Given a dataset , a set of tuples with an attribute set 𝐷 𝐼 = { 𝑖 An 1 , 𝑖

2 , , 𝑖 𝑚 } association rule from D: 𝐴 → 𝐵 𝐴 ⊂ 𝐼 , 𝐵 ⊂ 𝐼 , 𝐴 ∩ 𝐵 = Φ ( )

The support of rule 𝐴 → 𝐵 :

(3.2) 𝑠 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 ) 𝑡𝑜𝑡𝑎𝑙

is the percentage of tuples containing both A and B in D The Confidence of rule 𝐴 → 𝐵 :

(3.3) 𝑐 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )

Trang 39

is the percentage of tuples that contain both A and B in a tuple set containing A min_sup (s m ) and min_conf(c m ) are minimum support and confidence [15], individually, which are two input thresholds A rule 𝐴 → 𝐵 is valuable when

and are significant rule

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 ( 𝐴 → 𝐵 ) = 𝑠 ≥ 𝑠

𝑚 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ( 𝐴 → 𝐵 ) = 𝑐 ≥ 𝑐 𝑚

Suppose we have a significant association rule 𝐴 → 𝐵 that we want to keep This means that this rule has support and confidence values above the thresholds When we apply the changes, they may change some values of QI attributes of tuples that support this rule As a result, this tuple may stop supporting the rule Obviously, if we change too many tuples, the rule A→B may lose its significance Therefore, for each significant rule, we need to calculate the maximum number of tuples that we can change while keeping the rule significant

In addition, when we apply the changes, an insignificant rule may become a significant one As we discussed earlier, the algorithm also ensures that no new significant rules are created because they may alter the outcome of the data mining process Therefore, we also calculate the maximum number of tuples that we can change without creating new significant rules The algorithm will use these maximum numbers to calculate the cost for each change The cost of a change will be explained in later sections Based on the costs, the algorithm will find the best changes that will turn an unsafe group into a safe one In the following parts, we will calculate the budgets which is maximum number of tuples that we can change while keeping the rule significant:

Giving a significant association rule 𝐴 → 𝐵 , s is the support value and c is the confident value of this rule, and 𝑠 ≥ 𝑠 and When we perform a

Trang 40

● Case 1: A will be changed: We call n is the number of tuples which are

migrate, is the support value and is the confident value of the rule after 𝑠 ' 𝑐 ' performing the change To keep the rule significant, therefore, we must have

and : 𝑠 ' ≥ 𝑠 𝑚 𝑐 ' ≥ 𝑐 𝑚 (3.4) 𝑐 ' = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )− 𝑛 𝑓𝑟𝑞 ( 𝐴 )− 𝑛 (3.5) 𝑠 ' = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )− 𝑛 𝑡𝑜𝑡𝑎𝑙

where 𝑓𝑟𝑞 ( 𝐴 , 𝐵 ) is the number of tuples which have both A and B, 𝑓𝑟𝑞 ( 𝐴 ) is the number of tuples which only have A, 𝑡𝑜𝑡𝑎𝑙 is the number of tuples in the database Besides, we also have:

(3.7) 𝑠 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 ) 𝑡𝑜𝑡𝑎𝑙

(3.8) 𝑐 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )

𝑓𝑟𝑞 ( 𝐴 )

The maximal number of tuples, which can be altered, is

(3.9) 𝑛 = 𝑀𝐼𝑁 𝑡𝑜𝑡𝑎𝑙 * ( 𝑠 − 𝑠 𝑚 ), 𝑡𝑜𝑡𝑎𝑙 * 𝑠 * 𝑐 − 𝑐 ( 𝑚 ) 𝑐 * 1 − 𝑐 𝑚 ()⎡⎢⎣⎤⎥⎦()

● Case 2: B will be changed: Similarly, we have: (3.10) 𝑐 ' = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )− 𝑛 𝑓𝑟𝑞 ( 𝐴 )

(3.11) 𝑠 ' = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )− 𝑛

𝑡𝑜𝑡𝑎𝑙 Moreover, we also have:

(3.12) 𝑠 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 ) 𝑡𝑜𝑡𝑎𝑙

(3.13) 𝑐 = 𝑓𝑟𝑞 ( 𝐴 , 𝐵 )

𝑓𝑟𝑞 ( 𝐴 )

The conditions are 𝑠 ' >= 𝑠 and Therefore, we have: 𝑚 𝑐 ’ >= 𝑐 𝑚 (3.14) 𝑛 = 𝑀𝐼𝑁 𝑡𝑜𝑡𝑎𝑙 * ( 𝑠 − 𝑠 𝑚 ), 𝑡𝑜𝑡𝑎𝑙 * 𝑠 * 𝑐 − 𝑐 ( 𝑚 ) 𝑐 ⎡⎢⎣⎤⎥⎦()

● Case 3: Both A and B will be changed: We notice that this case is similar to case 1 Therefore, we have:

Định dạng
Số trang	71
Dung lượng	1,08 MB