Preserving privacy for publishing time series data with differential privacy

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

LẠI TRUNG MINH ĐỨC

PRESERVING PRIVACY FOR PUBLISHING-TIME-SERIES DATA

WITH DIFFERENTIAL PRIVACY

Major: COMPUTER SCIENCE Major code: 8480101

MASTER’S THESIS

Trang 2

THIS THESIS IS COMPLETED AT

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor 1: Assoc Professor DANG TRAN KHANH

Supervisor 2: PhD LE LAM SON

Examiner 1: Assoc Professor TRAN TUAN DANG

Examiner 2: PhD DANG TRAN TRI

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on 10 July 2023

Master’s Thesis Committee:

1 Chairman - Assoc Professor TRAN MINH QUANG 2 Secretary - PhD NGUYEN THI AI THAO

3 Examiner 1 - Assoc Professor TRAN TUAN DANG 4 Examiner 2 - PhD DANG TRAN TRI

5 Commissioner - PhD TRUONG THI AI MINH

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITEE

HEAD OF FACULTY OF

Trang 3

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: LẠI TRUNG MINH ĐỨC Student ID: 2070686

Date of birth: 24 May 1996 Place of birth: Ho Chi Minh City Major: Computer Science Major ID: 8480101

I THESIS TITLE:

Preserving Privacy for Publishing Time-series Data with Differential Privacy (Duy trì quyền riêng tư cho thời gian xuất bản dữ liệu chuỗi với quyền riêng tư

khác biệt)

II TASKS AND CONTENTS:

Week Task Time

W1 to W2

- Conduct the literature review and methodology to conduct the study

- Define scope of work for the main research of the thesis - Write up the report

2 weeks

W3 to W4

- Research of related works/projects on Differential Privacy, Time-series privacy

- Write up the report (cont.)

2 weeks

W5 to W14

- Implementing the algorithms of Differential Privacy on Time series data

- Comparing those algorithms with data utility metrics

- Finding the data characteristics to choose the best algorithms - Write up the report (cont.)

10 weeks

W15 to W16

- Finalize the solution package - Finalize the document - Prepare the presentation

2 weeks

Trang 4

III THESIS START DAY:

(According to the decision on assignment of Master’s thesis)

05 September 2022

IV THESIS COMPLETION DAY:

(According to the decision on the assignment of the Master’s thesis)

12 June 2023

V SUPERVISOR (Please fill in the supervisor’s full name and academic rank)

Assoc Prof DANG TRAN KHANH – PGS TS ĐẶNG TRẦN KHÁNH

PhD LE LAM SON – TS LÊ LAM SƠN

Ho Chi Minh City, 09 June 2023

SUPERVISOR 1 (Full name and signature)

SUPERVISOR 2 (Full name and signature)

Assoc Prof DANG TRAN KHANH PhD LE LAM SON

CHAIR OF PROGRAM COMMITTEE (Full name and signature)

DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature)

Note: Student must pin this task sheet as the first page of the Master’s Thesis

Trang 5

ACKNOWLEDGEMENT

I would like to express my profound gratitude to all those who have supported me throughout the journey of completing this master's thesis

First and foremost, I extend my heartfelt thanks to my supervisors, Assoc Prof DANG TRAN KHANH and PhD LE LAM SON, for your invaluable guidance, patience, and expertise Working under your supervision has been an honor, and your insightful feedback and encouragement have been instrumental in shaping the outcome of this research

I would also like to extend my sincere appreciation to the team at Unilever Vietnam, particularly Mr ERIC FRANCIS CHEN – Head of UniOps and Data & Analytics SEA&I, Mr IAN LOW - my line manager, and the awesome Unilever Data & Analytics Vietnam team Your willingness to assist, and insightful discussions have significantly contributed to the successful completion of this thesis

Furthermore, I want to acknowledge my friends, classmates, doctors, and psychologists for your mental supports and contributions Your motivation, discussions, treatments with diverse perspectives have supported me to live positively Special appreciation goes to Ms HOA NINH, Ms LINH PHAM, Ms XUAN NGUYEN, Mr RYAN NGUYEN for your huge support and encouragement throughout this journey

Lastly, I am indebted to my family: mom THAI LINH PHAM, sister MARY PHUC LAI for your unconditional love, and constant encouragement Your sacrifices and understanding have been the bedrock of my achievements I am profoundly grateful for your presence and support

To all those mentioned above, as well as those who have contributed in immeasurable ways, I offer my sincerest thanks Your efforts, belief, and contributions have made

Trang 6

ABSTRACT

This thesis explores the crucial domain of data privacy, encompassing the rights of individuals to maintain control over the collection, usage, sharing, and storage of their personal data Within the realm of personal data, time-series data poses distinct challenges and sensitivities when it comes to privacy protection Time-series data comprises information with temporal attributes that can unveil patterns, trends, and behaviors of individuals or groups over time, and carries inherent risks in terms of privacy breaches

The primary objectives of this thesis are as follows: first, to review traditional methods of privacy-preserving data publishing, with a specific focus on efforts made for protecting time-series data; second, to gain a comprehensive understanding of the theories and principles of Differential Privacy, a promising approach for privacy preservation; third, to explore notable mechanisms within Differential Privacy that are applicable to time-series data; fourth, to investigate and address privacy challenges in data partnerships through the integration of Differential Privacy and other relevant techniques; and finally, to develop a process for the application of privacy techniques within the context of business collaborations

Trang 7

TÓM TẮT LUẬN VĂN

Luận văn này nghiên cứu về lĩnh vực quan trọng của quyền riêng tư dữ liệu, bao gồm quyền của cá nhân để duy trì sự kiểm soát về việc thu thập, sử dụng, chia sẻ và lưu trữ dữ liệu cá nhân của họ Trong lĩnh vực dữ liệu cá nhân, dữ liệu chuỗi thời gian đặt ra những thách thức và nhạy cảm riêng biệt khi đến việc bảo vệ quyền riêng tư Dữ liệu chuỗi thời gian bao gồm thơng tin có thuộc tính thời gian có thể tiết lộ các mẫu, xu hướng và hành vi của cá nhân hoặc nhóm qua thời gian, và mang theo các rủi ro về việc vi phạm quyền riêng tư

Các mục tiêu chính của luận văn này như sau: thứ nhất, xem xét các phương pháp truyền thống về việc xuất bản dữ liệu bảo vệ quyền riêng tư, với tập trung đặc biệt vào các nỗ lực để bảo vệ dữ liệu chuỗi thời gian; thứ hai, để hiểu rõ về các lý thuyết và nguyên tắc của Sự khác biệt về Quyền riêng tư, một phương pháp hứa hẹn để bảo vệ quyền riêng tư; thứ ba, khám phá các cơ chế đáng chú ý trong Sự khác biệt về Quyền riêng tư mà có thể áp dụng cho dữ liệu chuỗi thời gian; thứ tư, điều tra và đối mặt với những thách thức về quyền riêng tư trong các đối tác dữ liệu thơng qua việc tích hợp Sự khác biệt về Quyền riêng tư và các kỹ thuật liên quan khác; và cuối cùng, phát triển quy trình để áp dụng các kỹ thuật bảo vệ quyền riêng tư trong bối cảnh hợp tác kinh doanh

Trang 8

THE COMMITMENT OF THESIS’ AUTHOR

I hereby declare that this master thesis is my own original work and has not been submitted before to any institution for assessment purposes

Further, I have acknowledged all sources used and have cited these in the reference section

…………………………… ………………………

Trang 9

TABLE OF CONTENTS

CHAPTER 1: OVERVIEW OF THE THESIS 1

1 Background and Context 1

2 Data Publishing and Privacy Preserving Data Publishing 2

3 Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series data 3 4 Differential Privacy as a powerful player 4

5 Thesis objectives 5

6 Thesis contributions 5

7 Thesis structure 5

CHAPTER 2: PRIVACY MODELS RESEARCHS 7

1 Attack models and notable privacy models 7

1.1 Record linkage attack and k-Anonymity privacy model 7

1.2 Attribute linkage attack and l-diversity and t-closeness privacy model 8

1.3 Table linkage and δ-presence privacy model 10

1.4 Probabilistic linkage and Differential Privacy model 11

2 Summary 12

CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY 14

1 The need for Differential Privacy principle 14

1.1 No need to model the attack model in detail 14

1.2 Quantifiable privacy loss 15

1.3 Multiple mechanisms composition 16

Trang 10

2.1 The promise 17

2.2 The not promise 18

2.3 Conclusion 18

3 Formal definition of Differential Privacy 18

3.1 Terms and notations 19

3.2 Randomized algorithm 20

3.3 𝜀-differential privacy 20

3.4 (𝜀, 𝛿) differential privacy 21

4 Important concepts of Differential Privacy 22

4.1 The sensitivity 23

4.2 Privacy composition 24

4.3 Post processing 26

5 Foundation mechanisms of Differential Privacy 26

5.1 Local Differential Privacy and Global Differential Privacy 26

5.2 Laplace mechanism 27

5.3 Exponential mechanism 29

6 Notable mechanisms for Time-series data 30

6.1 Laplace mechanism (LPA – Laplace Perturbation Algorithm) 30

6.2 Discrete Fourier Transform (DFT) with Laplace mechanism (FPA – Fourier Perturbation Algorithm) 31

6.3 Temporal perturbation mechanism 32

6.4 STL-DP – Perturbed time-series by applying DFT with Laplace mechanism on trends and seasonality 37

Trang 11

1 Experiment designs 38

1.1 Case description 38

1.2 Data structure aligns with data provider 39

1.3 System alignments 40

1.4 Concerns and constraints 41

2 Problem analysis 41

2.1 Revisit the GDPR related terms for data sharing 41

2.2 Potential attack models and countermeasures 44

2.3 Define scope of work 45

3 Evaluation methodology 46

3.1 Data utility 46

3.2 Privacy metrics 46

3.3 Evaluation process 47

4 Privacy protection proposal 47

CHAPTER 5: EXPERIMENT IMPLEMENTATIONS 49

1 Experiment preparation 49

2 Data exploration analysis (EDA) 49

2.1 Data overview 49

2.2 Descriptive Analysis 50

2.3 Maximum data domain estimation 52

3 Differential Privacy mechanisms implementation 54

4 Data perturbed evaluation 61

Trang 12

4.2 Forecasting trendline at categories, consumer-groups, and store level 65

4.3 Privacy evaluation 73

4.4 Recommendation for using Differential Privacy in data partnership use-cases 77 CHAPTER 6: CONCLUSION AND FUTURE WORKS 78

REFERENCES 80

APPENDIX 83

Table: Descriptive Statistics table of the data perturbation output 83

Table: Accuracy result of RFM analysis between data perturbations and original data 85

Table: Accuracy (RMSE) result from the forecast of data perturbations and original data 88

Trang 13

TABLE OF TABLES

Table 1 Original Shopping data table structure 39

Table 2 Shopping Data table structure for the use-case 40

Table 3 Data structure of the synthesis dataset 50

Table 4 Maximum domain value estimation for each Category 53

Table 5 Descriptive Statistics table of the data perturbation output (detail table in the Appendix) 60

Table 6 Accuracy result of RFM analysis between data perturbations and original data (detail in Appendix) 64

Table 7 Accuracy (RMSE) result from the forecast of data perturbations and original data (detail in Appendix) 70

Table 8 Accuracy (RMSE) result of the adjusted forecast version (detail in Appendix) 72

Trang 14

TABLE OF FIGURES

Figure 1 A fictitious database with hospital data, taken from [14] 8

Figure 2 Visualize how Sequential and Parallel Composition works – take from [25] 25

Figure 3 Visualize how Local Privacy and Global Privacy works - take from [25] 27 Figure 4 The Laplace mechanism with multiple scales 28

Figure 5 The visualization process of LPA and DFT (or FPA) - take from [19] 32

Figure 6 The visualization of Backward Perturbation mechanism (BPA) - take from [4] 34

Figure 7 The visualization of Forward Perturbation mechanism - take from [4] 35

Figure 8 The cost analysis table for BPA and FPA method - take from [4] 36

Figure 9 The pseudo-code for Threshold Mechanism - taken from [4] 36

Figure 10 The process of how STL-DP mechanism works - take from [27] 37

Figure 11 EDA - Fetch the synthesis dataset and display on Python 49

Figure 12 EDA - Code block and display for dataset descriptive statistics 50

Figure 13 EDA - Code block for dataset aggregation to find TOP 5 huge value consumers 51

Figure 14 EDA - Code block to visualize the purchase pattern over year by week of the dataset (aggregation) 52

Figure 15 Box-plot to visualize the quantity range - support for estimate the maximum data domain 54

Figure 16 Code-block - Import related libraries and initialize Ray multiprocessing 55

Figure 17 Code-block - Implementation of Laplace Perturbation Mechanism on time-series data 56

Figure 18 Code-block - Implementation of Fourier Perturbation Mechanism on time-series data 56

Trang 15

Figure 20 Code-block - Setup the Epsilon and Maximum data domain values 58

Figure 21 Code-block function to run the data perturbation process 59

Figure 22.Code-block for multi-processing run the whole process 60

Figure 23 Code-block to implement RFM Analysis 63

Figure 24 Code-block to measure accuracy of RFM clusters for data perturbation methods and the clusters of original data 64

Figure 25 EDA - Code-block to calculate number of timeseries/groups to conduct the forecast in next section 66

Figure 26 EDA - Code-block to calculate number of timeseries/groups to conduct the forecast in next section - Select only value > 300 days/year 67

Figure 27 Code-block for building data lookup dictionary for RFM bins 67

Figure 28 Code-block for running Simple Linear Regression to forecast, combine with the parameter M lookup table 68

Figure 29 Code-block for RMSE calculation for the forecast output 69

Figure 30 Adjusted version code-block to run Simple Linear Regression on the data perturbations and original data 71

Figure 31 Data Utility and Privacy trade-off (RMSE normal scale) 73

Figure 32 Data Utility and Privacy trade-off (RMSE Log10 scale) 74

Figure 33 EDA - Check the TOP 5 high value consumers in multiple methods (Original data, FPA, tFPA) 75

Figure 34 EDA - Plot Line chart to compare 3 methods and original volume of USER 944 76

Trang 16

CHAPTER 1: OVERVIEW OF THE THESIS 1 Background and Context

Data privacy, also referred to as information privacy or data protection, encompasses the fundamental right of individuals to exert control over the collection, usage, sharing, and storage of their personal data by others Personal data encompasses any information that identifies or relates to an individual, including but not limited to their name, address, phone number, email, health records, financial transactions, online activities, preferences, and opinions

Within the realm of personal data, time-series data poses unique challenges and sensitivities in terms of privacy protection Time-series data contains temporal information that can unveil patterns, trends, and behaviors of individuals or groups over time Examples of personal time-series data include weblogs, social media posts, GPS traces, health records, among others While time-series data serves various purposes, such as forecasting, anomaly detection, classification, clustering, and summarization, it also carries inherent risks if privacy breaches occur

For instance, an attacker could potentially deduce the identity of a user by matching their GPS traces or weblogs with publicly available information Similarly, the analysis of temporal patterns or correlations within sensor readings or social media posts could lead to the identification of a user Moreover, the observation of changes or trends over time in health records can reveal an individual's preferences, habits, and activities These scenarios underscore the significance of safeguarding privacy when dealing with time-series data to mitigate the potential risks associated with unauthorized access or misuse of personal information

Trang 17

apply anonymization techniques before publishing, there is still a significant risk of reverting identities For instances,

• In 2006, AOL released a dataset of 20 million search queries from 650,000 users over a three-month period [1] The dataset was intended for research purposes and was anonymized by replacing user IDs with random numbers However, researchers and journalists were able to re-identify some users by analyzing their search queries and linking them with other public information, such as news articles, blogs, or social media profiles Some users were exposed to embarrassment, harassment, or legal troubles because of their search histories being revealed

• In 2007, Netflix released a dataset of 100 million movie ratings from 480,000 users (about half the population of Montana) over a six-year period The dataset was intended for a contest to improve Netflix's recommendation system and was anonymized by removing usernames and other identifying information However, researchers [2] were able to re-identify some users by comparing their movie ratings with those on another movie website, IMDb Some users were exposed to privacy breaches as their movie preferences could reveal their political views, sexual orientation, or health conditions

2 Data Publishing and Privacy Preserving Data Publishing

Data publishing is the process of making data available to the public or other parties for analysis, research, or decision-making However, data often contains sensitive information about individuals, such as their identities, preferences, health conditions, or financial records If such data is published without proper protection, it may lead to privacy breaches and harm the individuals' interests or rights

Trang 18

There are different types of data that can be published, such as tabular data (e.g., census records, medical records) or graph data (e.g., social networks, web graphs) Depending on the data type and the privacy requirements, different PPDP

techniques can be used Some of the common PPDP techniques are (adapted from [7] and [16]):

• Generalization: This technique replaces specific values in the data with more general ones, such as replacing ages with age ranges or zip codes with regions Generalization reduces the granularity of the data and makes it harder to identify individuals based on their attributes

• Suppression: This technique removes some values or records from the data entirely, such as deleting names or phone numbers or omitting outliers Suppression reduces the amount of information in the data and makes it harder to link individuals across different sources

• Perturbation: This technique adds noise or randomness to the data, such as swapping values among records, adding or subtracting a small

amount from numerical values, or flipping bits in binary values Perturbation alters the original values in the data and makes it harder to infer the true values based on statistical analysis

• Encryption: This technique transforms the data using cryptographic methods, such as hashing, encryption, or homomorphic encryption

Encryption protects the confidentiality of the data and makes it harder to access the original values without a secret key

3 Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series data

Trang 19

• Correlation: Time-series data often exhibit strong temporal or spatial correlation, which means that some attributes’ values depend on the values of other attributes or previous time points This can lead to privacy breaches if an adversary can exploit this correlation to infer sensitive information from the published data

• Dynamics: Time-series data is often updated or streamed

continuously, which requires efficient and adaptive PPDP methods that can handle dynamic changes and updates without compromising privacy or utility

4 Differential Privacy as a powerful player

To address these challenges above, there is a principle that is accepted and explored by many researchers nowadays, called Differential Privacy – invented in 2006 by

Professor Cynthia Dwork (adapted from [7], [11], [16], [21], [22])

Differential privacy is a commitment made by a data holder or curator to a data subject: No matter what other research, data sets or information sources are

available, enabling the use of your data in any study or analysis will have no effect on you, whether positive or negative At their finest, differential private database systems may make secret data publicly accessible for effective data analysis without the need for clean data rooms, data usage agreements, or limited views

Differential privacy guarantees that the same conclusions, such as smoking causes cancer, will be made regardless of whether a person opts in or out of the data

collection Particularly, it assures that each output sequence (responses to queries) is "basically" equally likely to occur, regardless of the existence or absence of any individual Here, probabilities are taken over random selections made by the privacy mechanism (something controlled by the data curator), and the phrase "basically" is represented by the parameter

Trang 20

been applied to time-series data in various scenarios, such as smart metering, location-based services, or health monitoring

In this thesis, we will explore more Differential Privacy from theory to some proposed solutions to protect privacy of time-series data

5 Thesis objectives

• To review the traditional Privacy Preserving Data Publishing methods and the efforts for time-series data

• To understand the theories and principles of Differential Privacy • To explore notable mechanisms in Differential Privacy for time-series data

• To explore and solve the privacy use-case in the data partnership with Differential Privacy and other techniques Then, build the process to apply privacy techniques into business collaboration

6 Thesis contributions

• An effort to make differential privacy gets easier to understand, especially for non-academic audiences and corporate audiences

• A suggested guideline to apply and evaluate differential privacy on time-series data in the use-case of data collaboration between multiple parties

7 Thesis structure

Chapter 1: Background and context (this chapter) – summaries of the purpose of data publishing, the purpose of privacy protection for data released, the challenges on time-series data, and the brief of the Differential Privacy principle

Chapter 2: Literature review – summaries and compares privacy attack models with notable algorithms, quickly analyses the weaknesses of its and the connection to Differential Privacy

Trang 21

Chapter 4: Experiment designs – based on the synthesis use-case from real world problem, analyses and proposes a process and solution to protect privacy of individuals in data partnership process

Chapter 5: Experiment implementations – based on the proposal in Chapter 4 and utilize knowledge in Differential Privacy technique and Data Analytics to

implement related requirements

Trang 22

CHAPTER 2: PRIVACY MODELS RESEARCHS

1 Attack models and notable privacy models

1.1 Record linkage attack and k-Anonymity privacy model

Record linkage attacks (adapted from [7], [9], [16]) are privacy attacks that exploit the linkage of records across different data sources to reveal sensitive information about individuals These attacks typically involve identifying records in different datasets that correspond to the same individual and then linking the records together to obtain additional information Usually, the linkage attack happens based on quasi-identifiers (QIDs): such as age, gender, or ZIP code, to link records that correspond to the same individual across different datasets

From the research of Sweeny [9], based on just ZIP code, date of birth, and gender, there is 87% chance to identify uniquely individuals in the US, which is a very high probability

To counter this attack, k-anonymity method has been proposed by Sweeny [9]: if one record in the table has some value QID, then at least k-1 other records also have the value QID, which means each record is indistinguishable from at least k-1 other records with respect to QID Therefore, the probability of linking a victim to a specific record through the QID is at most 1/k

k-anonymity uses data generalization and suppression techniques to achieve that every combination of values for the quasi-identifiers can be indistinguishably matched to at least k-individuals in one equivalence class

Trang 23

Figure 1 A fictitious database with hospital data, taken from [14]

1.2 Attribute linkage attack and l-diversity and t-closeness privacy model

The attribute linkage attack (adapted from [7], [14], [16]) is a method that enables an attacker to infer sensitive values of a target victim from the published data, using the set of sensitive values associated with the group that the victim belongs to, even when the attacker cannot accurately identify the victim's record K-anonymized databases suffer from attribute disclosure through two attack forms: the homogeneity attack and the background knowledge attack The homogeneity attack is possible when the sensitive attributes lack diversity, while the background knowledge attack relies on having partial knowledge of an individual or a distribution of sensitive and non-sensitive attributes in a population

Trang 24

However, a drawback of l-diversity is that it can be too difficult or unnecessary to achieve Imagine a scenario with 10000 data records with a sensitive attribute Let the class be A for 99% of the individuals and B for the remaining 1% Both possible values have different sensitivities One would not mind being tested negative, thus, in an equivalence class that only contains A-entries for the sensitive attribute, l-diversity is unnecessary But to achieve 2-l-diversity, there would only exist 10000 * 0.01 = 100 equivalence classes, which would lead to a large information loss

Both l-diversity and k-anonymity are insufficient to prevent attribute disclosure, as there are two conceivable attacks: the skewness attack and the similarity attack If the overall distribution of sensitive attributes is skewed, as in the example of 99% to 1%, a skewness attack is conceivable Imagine that one equivalence class contained the same amount of positive and negative records The database would still be 2-diverse, but the risk of anyone in that class being considered positive would increase from 1% to 50%, which could compromise the privacy of individuals Similarity attacks are possible if the sensitive attribute values within a class of equivalence are distinct but semantically similar Consider once more the hospital example and the equivalence class with the sensitive attribute values pulmonary cancer, breast cancer, and brain cancer These are distinct values, but they all represent a form of cancer; therefore, a person in this equivalence class has cancer

To overcome the similarity attacks and skewness attacks of l-diversity, Li et al [15] introduced t-closeness, which is a privacy model that is more stringent than l-diversity It requires that the distribution of the sensitive attribute values in each equivalence class be t-close to the distribution of the sensitive attribute values in the entire dataset This ensures that an attacker cannot infer the value of a sensitive attribute for an individual record by simply looking at the distribution of the sensitive attribute values in the equivalence class that the record belongs to

Trang 25

the sensitive attribute values in the entire dataset This makes it more difficult for an attacker to infer the value of a sensitive attribute for an individual record

Although t-closeness is better than l-diversity in terms of protection, it has big drawbacks, due to the expensive computational, and big loss in data utility Therefore, in the era of big-data, l-diversity and t-closeness are hardly mentioned as best-practices

1.3 Table linkage and δ-presence privacy model

The concepts of record linkage and attribute linkage typically assume that the attacker is already aware of the presence of the victim's record in the published table However, this may not always be the case In some situations, the mere existence or absence of the victim's record in the dataset can reveal sensitive information For instance, consider a scenario where a medical facility releases a data table concerning a specific illness The mere presence of the victim's record in the database can have detrimental consequences Table linkage occurs when an attacker can reliably deduce either the presence or absence of the victim's record in the published table

In 2007, Nergiz (adapted from [7], [16]) proposed a solution to the table linkage problem called the delta-presence (δ-presence) approach The fundamental idea is to constrain or "bound" the probability of inferring the presence of each potential victim's data within a designated range referred to as δ The δ-presence approach offers the advantage of indirectly preventing record and attribute linkages If an attacker has a confidence level of at most δ% regarding the presence of the victim's record in the leaked table, then the likelihood of successfully linking to their record and sensitive attributes is limited to at most percent Therefore, δ-presence can indirectly mitigate the risks associated with record and attribute connections

Trang 26

1.4 Probabilistic linkage and Differential Privacy model

There exists a distinct category of privacy models that diverge from the precise identification of records, attributes, and tables that an attacker can link to a specific target individual Instead, these models focus on the alteration of an attacker's probabilistic beliefs concerning the sensitive information of a victim following access to published data The overarching objective of this family of privacy models is to uphold the uninformative principle, which strives to minimize the discrepancy between prior and posterior beliefs

In 2006, Cynthia Dwork (adapted from [7], [12], [16], [22]) introduced a

compelling concept known as Differential Privacy, emphasizing the protection of individuals' privacy in statistical databases Dwork posited that the participation of record owners in such databases should not significantly heighten the risk to their privacy Rather than comparing the prior and posterior probabilities before and after accessing the published data, Dwork proposed comparing the risk with and without the inclusion of the record owner's data in the public dataset Consequently, Dwork formulated a privacy model termed differential privacy, aimed at ensuring that the addition or removal of a single database record has an inconsequential impact on any analysis outcomes This implies that the merging of separate databases does not expose the organization to additional risks

Trang 27

This robust guarantee is achieved by comparing the published data with and without the contribution of the record owner's data

2 Summary

The fundamental principles of data privacy protection and their widespread adoption can be acknowledged, although it is widely recognized that no singular solution can fully guarantee privacy The concept of k-Anonymity was developed to safeguard individuals against record linkage However, in cases where k-Anonymity proves inadequate in preventing attribute linkage, the notion of l-Diversity was introduced, subsequently leading to the development of a more stringent variant known as t-Closeness To address the risks associated with the presence of an individual's information in a dataset, the concept of δ-presence was formulated Furthermore, Cynthia Dwork approached privacy concerns from both a priori and posterior perspective by devising Differential Privacy, a mathematical framework that incorporates measurable noise to model privacy problem

Based on extensive research ([11], [21], [25]), Differential Privacy offers remarkable advantages, as summarized below:

• Protection against various attack models: Unlike other privacy mechanisms, Differential Privacy does not require detailed modeling of specific attack models Regardless of the attacker's goals or background knowledge, the mechanisms proven to satisfy Differential Privacy can still effectively protect privacy

• Measurable privacy loss: While the addition of noise to data for privacy protection has been practiced since the 1970s, using techniques such as Statistical Disclosure Control, Differential Privacy provides a means to quantify the amount of noise necessary for a privacy solution while balancing it with data utility

Trang 28

multiple privacy algorithms, resulting in a more robust and reasonable solution

Trang 29

CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY 1 The need for Differential Privacy principle

As mentioned in the previous chapters, Differential Privacy could be a key solver for the limitation of k-Anonymity, l-Diversity and its variant when dealing with multiple attack types and could improve performance because of its low complexity in calculation We can explore some further good reasons to consider using DP here (adapted from [11], [21], [25])

1.1 No need to model the attack model in detail

All previous privacy models required certain assumptions about the capabilities and objectives of the attacker Selecting the appropriate notion of privacy necessitated understanding the extent of the attacker's prior knowledge, the auxiliary data they could utilize, and the specific information they aimed to uncover

Implementing these definitions in practice proved challenging and prone to errors Addressing these questions was a delicate task, as the attacker's intentions and capabilities might not be fully known or anticipated Furthermore, there were potential unknown unknowns, representing unforeseen attack vectors that could undermine privacy Consequently, making definitive statements based on these traditional definitions was difficult, as they relied on assumptions that could not be guaranteed with absolute certainty

In contrast, the adoption of differential privacy provides two remarkable guarantees:

• Protection of all forms of individual information: Regardless of the attacker's objectives, whether it is re-identifying a target, determining their presence in the dataset, or inferring sensitive attributes, all such endeavors are safeguarded Consequently, there is no need to explicitly consider the attacker's goals when applying differential privacy

Trang 30

Privacy guarantees extend to individuals whom the attacker is unfamiliar with, ensuring their protection regardless of the attacker's level of

knowledge

By embracing differential privacy, organizations can confidently safeguard personal information across various scenarios, irrespective of the attacker's goals or

knowledge about the data

1.2 Quantifiable privacy loss

Differential privacy, unlike older privacy notions, introduces a numeric parameter that can be adjusted However, the significance of this parameter differs

significantly Consider the case of k-anonymity, where the parameter k signifies that each record in the output dataset resembles at least k-1 other records However, the value of k alone does not provide a clear indication of the level of privacy

protection afforded

The relationship between the value of k and the actual privacy of the dataset is tenuous at best Selecting an appropriate value for k is often subjective and lacks a formal justification This issue becomes even more pronounced with other

conventional privacy definitions

In contrast, differential privacy offers a more meaningful approach It allows for the quantification of the maximum potential information gained by an attacker The corresponding parameter, denoted as ε, enables formal statements to be made For instance, if ε = 1.1, it can be stated that "an attacker who initially believes their target is present in the dataset with a probability of 50% can increase their level of certainty to no more than 75%." While determining the precise value of ε may pose challenges, it can be interpreted and reasoned about in a formal manner

Trang 31

these characteristics make differential privacy significantly stronger than all preceding definitions, particularly in terms of attack modeling and the ability to quantify privacy guarantees

1.3 Multiple mechanisms composition

Consider a scenario where you possess a dataset that you intend to share with two trusted individuals, Alice and Bob, in an anonymized form While you trust both Alice and Bob equally, they have distinct interests in the data, prompting you to provide them with different versions of the anonymized dataset, adhering to the same privacy definition

However, a potential concern arises if Alice and Bob conspire and compare the data they received In most privacy definitions, combining two k-anonymous versions of the same data does not preserve k-anonymity Consequently, the union of their anonymized versions may no longer maintain the desired level of anonymity This collaboration between Alice and Bob could potentially lead to the reidentification of individuals or even the complete reconstruction of the original data, which is a significant privacy risk

Differential privacy, on the other hand, avoids this vulnerability Suppose you provided differentially private data to both Alice and Bob, utilizing a privacy parameter of ε for each instance If they conspire and combine their data, the resulting dataset remains protected by differential privacy However, the level of privacy becomes weaker, with the privacy parameter becoming 2ε While some information may still be gained through their collaboration, the extent of this information gain can be quantified This property is known as composition

Trang 32

maintain control over the level of risk as new use cases emerge and processes evolve, providing a robust framework for privacy management

2 The promise (and not promised) of Differential Privacy

Adapted from [6]: Differential Privacy: A Primer for a Non-technical Audience (2017) of professor Kobbi Nissim and his team

2.1 The promise

The case: “Researchers selected a sample of individuals to participate in a survey

exploring the relationship between socioeconomic status and medical outcomes across several U.S cities Individual respondents were asked to complete a questionnaire covering topics such as where they live, their finances, and their medical history One of the participants, John, is aware that individuals have been re-identified in previous releases of de-identified data and is concerned that personal information he provides about himself, such as his HIV status or annual income, could one day be revealed in de-identified data released from this study If leaked, the personal information John provides in response to the questionnaire used in this study could lead to an increase in his life insurance premium or an adverse decision on a mortgage application he submits in the future.”

Analysis: Differential privacy aims to protect John’s privacy in the real-world

scenario in a way that mimics the privacy protection he is afforded in his opt-out scenario Accordingly, what can be learned about John from a differential private computation is (essentially) limited to what could be learned about him from everyone else’s data without his own data being included in the computation Crucially, this very same guarantee is made not only with respect to John, but also with respect to every other individual contributing his or her information to the analysis! A more precise description of the differential privacy guarantee requires the use of formal mathematical language, as well as technical concepts and

Trang 33

precise definition, this document offers a few illustrative examples to discuss various aspects of differential privacy in a way we hope is intuitive and accessible

2.2 The not promise

The case: “Suppose Alice is a friend of John’s and possesses some knowledge

about him, such as that he regularly consumes several glasses of red wine with dinner Alice later learns of a medical research study that found a positive correlation between drinking red wine and the occurrence of a certain type of cancer She might therefore conclude, based on the results of this study and her prior knowledge of John’s drinking habits, that he has a heightened risk of developing cancer.”

Analysis: It may seem at first that the publication of the results from the medical

research study enabled a privacy breach by Alice After all, learning about the study’s findings helped her infer new information about John, i.e., his elevated cancer risk However, notice how Alice would be able to infer this information about John even if John had not participated in the medical studies, it is a risk that exists in both John’s opt-out scenario and the real-world scenario In other words, this risk applies to everyone, regardless of whether they contribute personal data to the study or not

2.3 Conclusion

While Differential Privacy is effective in safeguarding privacy when publishing datasets, it is crucial to acknowledge that certain information inference scenarios pose inherent risks of inference even in the absence of the published dataset The scope of Differential Privacy does not encompass this realm Differential Privacy primarily operates under the assumption that it assists in preventing an attacker from altering their posterior perspective after obtaining the dataset, regardless of whether they possess or lack access to the dataset, thereby maintaining their pre-existing knowledge and beliefs about the intended target

Trang 34

3.1 Terms and notations

Knowledge adapted from [12], [20], [23], [25]

Data curator A data curator manages the collected data throughout its life cycle

This can include data sanitization, annotation, publication, and presentation The goal is to ensure that the data can be reliably reused and preserved In DP, the data curator is responsible for ensuring that the privacy of no individual represented in the data can be violated

Adversary The adversary represents a data analyst that is interested in finding out

(sensitive) information about the individuals in the dataset In the context of DP even the legitimate user of the database is referred to as an adversary, as his analyses can damage the individual’s privacy

L1-norm The L1-norm of a database D is denoted by | |𝐷 | |1 It measures the size of D (e.g.: number of records it contains) and can be defined as

||𝐷 ||1 = ∑ |𝑑𝑖|

|𝜒|

𝑖 = 1

L1-distance The L1-distance between two databases 𝐷1 𝑎𝑛𝑑 𝐷2 is ||𝐷1 – 𝐷2 ||1 It measures how many records differ between both databases and can be defined as:

| |𝐷1 − 𝐷2 | |1 = (∑ |𝑑1,𝑖 𝑑2,𝑖 |

|𝜒|𝑖 = 1

2 )

Neighboring databases Two databases 𝐷1 𝑎𝑛𝑑 𝐷2 are called neighboring if they differ only in at most ONE element This can be expressed by this:

| |𝐷1 − 𝐷2 | |1 ≤ 1

Mechanism Differential Privacy itself is only an abstract concept (or principle) The

Trang 35

3.2 Randomized algorithm

Differential privacy is a property of some randomized algorithms So, we will start by defining what a randomized algorithm is We will mostly focus on randomized algorithms whose probabilistic space is discrete To formalize this, we will use the notion of probability simplex

Definition 3.1 (Probability simplex) Given a discrete set B, the probability simplex over B, denoted 𝛥(𝐵) is the set:

Δ(𝐵) = 𝑥𝜖𝑅|𝐵|: ∀𝑖, 𝑥𝑖 ≥ 0, 𝑎𝑛𝑑 ∑ 𝑥𝑖 = 1

|𝐵|

𝑖=1

Definition 3.2 (Randomized Algorithms) A randomized algorithm M with domain A and range B is an algorithm associated with a total map M: A → ∆(B) On input a

∈ A, the algorithm M outputs 𝑀(𝑎) = 𝑏 with probability (𝑀(𝑎))𝑏 for each b ∈ B The probability space is over the coin flips of the algorithm M

The definition clarifies that a randomized algorithm is an algorithm that operates deterministically, with two inputs: the dataset and a string of random bits It is important to note that we will soon encounter the definition of differential privacy, which relates to the probability associated with the randomness of the internal algorithm while keeping the dataset fixed In this context, the crucial aspect of the definition is that the probability space is associated with the coin flips of the algorithm M, highlighting its role as the source of randomness

3.3 𝜺-differential privacy

Trang 36

Definition 3.3 Let 𝜀 > 0 Define a randomized function M to be (𝜀) differentially private if for all neighboring input datasets 𝐷1 𝑎𝑛𝑑 𝐷2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆]𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] ≤ 𝑒

𝜀

where the probability is taken over the coin tosses of M

The definition also implies a lower bound: since we can interchange D1 and D2 mutually:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆] ≤ 𝑒𝜀 ∗ 𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] We can also obtain the constraint using natural logarithm:

−𝜀 ≤ 𝑙𝑛 (𝑃𝑟[𝑀(𝐷1) ∈ 𝑆]

𝑃𝑟[𝑀(𝐷2) ∈ 𝑆]) ≤ 𝜀

Working with 𝑒𝜀, and thus logarithmic probabilities, has also other advantages in the practical application with computers (adapted from [25]):

• Computation Speed: The product of two probabilities corresponds to an addition in logarithmic space and multiplication is computationally more expensive than addition

• Accuracy: Using logarithmic probabilities improves numerical stability when probabilities are very small, because of the way in which computers approximate real numbers With normal probabilities they produce more rounding errors

• Simplicity: Many probability distributions, especially the ones from which the random noise is drawn, have exponential form Taking the

logarithm of these distributions eliminates the exponential function, making it possible to calculate with the exponent only

Trang 37

𝜀 -DP has a high privacy requirement for mechanisms But since adding too much noise to the original data can limit the amount of information drastically, several weakened versions of DP have been proposed One of the most popular versions is (𝜀, 𝛿) − Differential Privacy

Definition 3.4 (𝜀, 𝛿) differential privacy

Define a randomized function M to be (𝜀) differentially private if for all

neighboring input datasets 𝐷1 𝑎𝑛𝑑 𝐷2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆] <= 𝑒𝜀 ∗ 𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] + 𝛿 In this case the two parameters 𝜀 and δ control the level of privacy

The strongest version of differential privacy, in which δ = 0, is known as pure differential privacy while the more general case where δ > 0 is known as

approximate differential privacy The meaning of δ can be seen as the probability of data leak Sometimes, this raises concerns regarding its presence in a privacy

definition However, there is nothing as a silver bullet for all privacy problems, and everything in life needs to do risk management Therefore, it’s reasonable to

consider δ as the real-life risk factor and have proper scenario planning for it

This shows that δ should be small Intuitively, that makes sense, because, with a large δ, even if 𝜀 = 0 (perfect secrecy), a mechanism that is (0, δ)-differentially private, would breach privacy with high probability A common heuristic to choose δ for a database with n records is 𝛿 < 1

𝑛 This is because with an (𝜀, δ)- DP mechanism, the privacy for each record in the database is given away with

probability δ But noted, (𝜀, δ)-DP is much weaker than (𝜀, 0)-DP, even when δ is very small in relation to the size n of the database

Trang 38

4.1 The sensitivity

This property facilitates the estimation of the required level of noise perturbation in the differential privacy mechanism, which quantifies the extent to which the output of a function will change when its input is altered

4.1.1 Global sensitivity

It demonstrates the maximum differences between the query results of neighboring databases that are utilized in one of the differentially private mechanisms The formal definition:

𝐺𝑆(𝑓) = ||𝑓(𝐷1) − 𝑓(𝐷2)||

1𝐷1,𝐷2

𝑚𝑎𝑥

Consider the COUNT function as an example, where GS(f) = 1 This is because of the addition of a row to the dataset can, at most, increase the query's output by 1: either the new row possesses the desired attribute, leading to a count increment of 1, or it lacks the attribute, maintaining the same count (the count may correspondingly decrease when a row is removed)

4.1.2 Local sensitivity

Thus far, we have exclusively examined one metric of sensitivity, namely global sensitivity Our definition of global sensitivity considers any two neighboring datasets However, this approach may be overly pessimistic because when running our differential privacy mechanisms on a real dataset, it would be more relevant to consider the neighbors of that specific dataset

This is where the concept of local sensitivity comes into play Local sensitivity considers one dataset as the actual dataset being queried and considers all its

Trang 39

Formally, the local sensitivity is defined:

𝐿𝑆(𝑓, 𝐷2) = ||𝑓(𝐷1) − 𝑓(𝐷2)||1

𝐷2𝑚𝑎𝑥

Local sensitivity offers a valuable means of establishing finite bounds on the sensitivity of certain functions, especially when it is challenging to bound their global sensitivity The mean function serves as an illustrative example

We computed differentially private means by dividing the query into two separate queries: a differentially private sum (the numerator) and a differentially private count (the denominator) Through sequential composition and post-processing (which will be covered later), the resulting quotient ensures differential privacy

We use this approach because the extent to which the output of a mean query may change when a row is added to or removed from the dataset is contingent upon the dataset's size If we were to bound the global sensitivity of a mean query, we would need to assume the worst-case scenario: a dataset of size 1 However, this

assumption is overly pessimistic for large datasets, and thus the "noisy sum over noisy count" methodology is significantly more advantageous

4.2 Privacy composition

Any approach to privacy should address the issues of composition Composition is the execution of several queries on the same dataset These queries can be

Trang 40

Figure 2 Visualize how Sequential and Parallel Composition works – take from [25]

4.2.1 Sequential composition

This helps bound the total privacy cost of releasing multiple results of differentially private mechanisms on the same input data Formally, the sequential composition theorem for differential privacy:

• If 𝐾1(𝑥) satisfies ϵ1-differential privacy • And 𝐾2(𝑥) satisfies 𝜖2-differential privacy

• Then 𝐺(𝑥) = (𝐾1(𝑥), 𝐾2(𝑥)) which releases both results satisfies ϵ1+ ϵ2-differential privacy

Sequential composition is a vital property of differential privacy because it enables the design of algorithms that consult the data more than once Sequential

composition is also important when multiple separate analyses are performed on a single dataset, since it allows individuals to bound the total privacy cost, they incur by participating in all these analyses The bound-on privacy cost given by

sequential composition is an upper bound - the actual privacy cost of two differentially private releases may be smaller than this, but never larger

4.2.2 Parallel composition

Định dạng
Số trang	110
Dung lượng	2,52 MB