Luận văn thạc sĩ Khoa học máy tính: Preserving privacy for publishing-time-series data with differential privacy

Trang 1

LẠI TRUNG MINH ĐỨC

PRESERVING PRIVACY FOR PUBLISHING-TIME-SERIES DATA

WITH DIFFERENTIAL PRIVACY

Major: COMPUTER SCIENCE Major code: 8480101

MASTER’S THESIS

HO CHI MINH CITY, July 2023

Trang 2

THIS THESIS IS COMPLETED AT

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor 1: Assoc Professor DANG TRAN KHANH

Supervisor 2: PhD LE LAM SON

Examiner 1: Assoc Professor TRAN TUAN DANG

Examiner 2: PhD DANG TRAN TRI

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on 10 July 2023

Master’s Thesis Committee:

1 Chairman - Assoc Professor TRAN MINH QUANG 2 Secretary - PhD NGUYEN THI AI THAO

3 Examiner 1 - Assoc Professor TRAN TUAN DANG 4 Examiner 2 - PhD DANG TRAN TRI

5 Commissioner - PhD TRUONG THI AI MINH

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITEE

HEAD OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

Trang 3

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: LẠI TRUNG MINH ĐỨC Student ID: 2070686

Date of birth: 24 May 1996 Place of birth: Ho Chi Minh City Major: Computer Science Major ID: 8480101

I THESIS TITLE:

Preserving Privacy for Publishing Time-series Data with Differential Privacy (Duy trì quyền riêng tư cho thời gian xuất bản dữ liệu chuỗi với quyền riêng tư

khác biệt)

II TASKS AND CONTENTS:

- Implementing the algorithms of Differential Privacy on Time series data

- Comparing those algorithms with data utility metrics

- Finding the data characteristics to choose the best algorithms - Write up the report (cont.)

10 weeks

W15 to W16

- Finalize the solution package - Finalize the document - Prepare the presentation

2 weeks

Thesis Submission – June 2023

Trang 4

III THESIS START DAY:

(According to the decision on assignment of Master’s thesis)

05 September 2022

IV THESIS COMPLETION DAY:

(According to the decision on the assignment of the Master’s thesis)

12 June 2023

V SUPERVISOR (Please fill in the supervisor’s full name and academic rank)

Assoc Prof DANG TRAN KHANH – PGS TS ĐẶNG TRẦN KHÁNH

PhD LE LAM SON – TS LÊ LAM SƠN

Ho Chi Minh City, 09 June 2023

SUPERVISOR 1 (Full name and signature)

SUPERVISOR 2 (Full name and signature)

Assoc Prof DANG TRAN KHANH PhD LE LAM SON

CHAIR OF PROGRAM COMMITTEE (Full name and signature)

DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature)

Note: Student must pin this task sheet as the first page of the Master’s Thesis

booklet

Trang 5

I would also like to extend my sincere appreciation to the team at Unilever Vietnam, particularly Mr ERIC FRANCIS CHEN – Head of UniOps and Data & Analytics SEA&I, Mr IAN LOW - my line manager, and the awesome Unilever Data & Analytics Vietnam team Your willingness to assist, and insightful discussions have significantly contributed to the successful completion of this thesis

Furthermore, I want to acknowledge my friends, classmates, doctors, and psychologists for your mental supports and contributions Your motivation, discussions, treatments with diverse perspectives have supported me to live positively Special appreciation goes to Ms HOA NINH, Ms LINH PHAM, Ms XUAN NGUYEN, Mr RYAN NGUYEN for your huge support and encouragement throughout this journey

Lastly, I am indebted to my family: mom THAI LINH PHAM, sister MARY PHUC LAI for your unconditional love, and constant encouragement Your sacrifices and understanding have been the bedrock of my achievements I am profoundly grateful for your presence and support

To all those mentioned above, as well as those who have contributed in immeasurable ways, I offer my sincerest thanks Your efforts, belief, and contributions have made

this thesis possible

Trang 6

ABSTRACT

This thesis explores the crucial domain of data privacy, encompassing the rights of individuals to maintain control over the collection, usage, sharing, and storage of their personal data Within the realm of personal data, time-series data poses distinct challenges and sensitivities when it comes to privacy protection Time-series data comprises information with temporal attributes that can unveil patterns, trends, and behaviors of individuals or groups over time, and carries inherent risks in terms of privacy breaches

The primary objectives of this thesis are as follows: first, to review traditional methods of privacy-preserving data publishing, with a specific focus on efforts made for protecting time-series data; second, to gain a comprehensive understanding of the theories and principles of Differential Privacy, a promising approach for privacy preservation; third, to explore notable mechanisms within Differential Privacy that are applicable to time-series data; fourth, to investigate and address privacy challenges in data partnerships through the integration of Differential Privacy and other relevant techniques; and finally, to develop a process for the application of privacy techniques within the context of business collaborations

The contribution of this thesis is twofold Firstly, it aims to make the concept of Differential Privacy more accessible and comprehensible, particularly for non-academic and corporate audiences who may not have a deep technical background By presenting Differential Privacy in a clear and straightforward manner, this research facilitates its adoption and implementation in real-world scenarios Secondly, this thesis proposes a guideline for the practical application and evaluation of Differential Privacy on time-series data, specifically within the context of data collaboration among multiple parties The guideline serves as a valuable resource for organizations seeking to protect privacy while engaging in collaborative data initiatives

Trang 7

TÓM TẮT LUẬN VĂN

Luận văn này nghiên cứu về lĩnh vực quan trọng của quyền riêng tư dữ liệu, bao gồm quyền của cá nhân để duy trì sự kiểm soát về việc thu thập, sử dụng, chia sẻ và lưu trữ dữ liệu cá nhân của họ Trong lĩnh vực dữ liệu cá nhân, dữ liệu chuỗi thời gian đặt ra những thách thức và nhạy cảm riêng biệt khi đến việc bảo vệ quyền riêng tư Dữ liệu chuỗi thời gian bao gồm thông tin có thuộc tính thời gian có thể tiết lộ các mẫu, xu hướng và hành vi của cá nhân hoặc nhóm qua thời gian, và mang theo các rủi ro về việc vi phạm quyền riêng tư

Các mục tiêu chính của luận văn này như sau: thứ nhất, xem xét các phương pháp truyền thống về việc xuất bản dữ liệu bảo vệ quyền riêng tư, với tập trung đặc biệt vào các nỗ lực để bảo vệ dữ liệu chuỗi thời gian; thứ hai, để hiểu rõ về các lý thuyết và nguyên tắc của Sự khác biệt về Quyền riêng tư, một phương pháp hứa hẹn để bảo vệ quyền riêng tư; thứ ba, khám phá các cơ chế đáng chú ý trong Sự khác biệt về Quyền riêng tư mà có thể áp dụng cho dữ liệu chuỗi thời gian; thứ tư, điều tra và đối mặt với những thách thức về quyền riêng tư trong các đối tác dữ liệu thông qua việc tích hợp Sự khác biệt về Quyền riêng tư và các kỹ thuật liên quan khác; và cuối cùng, phát triển quy trình để áp dụng các kỹ thuật bảo vệ quyền riêng tư trong bối cảnh hợp tác kinh doanh

Đóng góp của luận văn này là hai phần Thứ nhất, nó nhằm mục tiêu làm cho khái niệm về Sự khác biệt về Quyền riêng tư trở nên dễ tiếp cận và dễ hiểu, đặc biệt là đối với các đối tượng không học thuật và doanh nghiệp có thể không có nền tảng kỹ thuật sâu Bằng cách trình bày Sự khác biệt về Quyền riêng tư một cách rõ ràng và dễ hiểu, nghiên cứu này hỗ trợ việc áp dụng và thực hiện nó trong các tình huống thực tế Thứ hai, luận văn này đề xuất một hướng dẫn cho việc áp dụng thực tiễn và đánh giá Sự khác biệt về Quyền riêng tư trong dữ liệu chuỗi thời gian, đặc biệt trong bối cảnh hợp tác dữ liệu giữa nhiều bên

Trang 8

THE COMMITMENT OF THESIS’ AUTHOR

I hereby declare that this master thesis is my own original work and has not been submitted before to any institution for assessment purposes

Further, I have acknowledged all sources used and have cited these in the reference section

LAI TRUNG MINH DUC Date

Trang 9

TABLE OF CONTENTS

CHAPTER 1: OVERVIEW OF THE THESIS 1

1 Background and Context 1

2 Data Publishing and Privacy Preserving Data Publishing 2

3 Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series data 3 4 Differential Privacy as a powerful player 4

5 Thesis objectives 5

6 Thesis contributions 5

7 Thesis structure 5

CHAPTER 2: PRIVACY MODELS RESEARCHS 7

1 Attack models and notable privacy models 7

1.1 Record linkage attack and k-Anonymity privacy model 7

1.2 Attribute linkage attack and l-diversity and t-closeness privacy model 8

1.3 Table linkage and δ-presence privacy model 10

1.4 Probabilistic linkage and Differential Privacy model 11

2 Summary 12

CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY 14

1 The need for Differential Privacy principle 14

1.1 No need to model the attack model in detail 14

1.2 Quantifiable privacy loss 15

1.3 Multiple mechanisms composition 16

2 The promise (and not promised) of Differential Privacy 17

Trang 10

2.1 The promise 17

2.2 The not promise 18

2.3 Conclusion 18

3 Formal definition of Differential Privacy 18

3.1 Terms and notations 19

5 Foundation mechanisms of Differential Privacy 26

5.1 Local Differential Privacy and Global Differential Privacy 26

5.2 Laplace mechanism 27

5.3 Exponential mechanism 29

6 Notable mechanisms for Time-series data 30

6.1 Laplace mechanism (LPA – Laplace Perturbation Algorithm) 30

6.2 Discrete Fourier Transform (DFT) with Laplace mechanism (FPA – Fourier Perturbation Algorithm) 31

6.3 Temporal perturbation mechanism 32

6.4 STL-DP – Perturbed time-series by applying DFT with Laplace mechanism on trends and seasonality 37

CHAPTER 4: EXPERIMENT DESIGNS 38

Trang 11

2.1 Revisit the GDPR related terms for data sharing 41

2.2 Potential attack models and countermeasures 44

2.3 Define scope of work 45

3 Evaluation methodology 46

3.1 Data utility 46

3.2 Privacy metrics 46

3.3 Evaluation process 47

4 Privacy protection proposal 47

CHAPTER 5: EXPERIMENT IMPLEMENTATIONS 49

1 Experiment preparation 49

2 Data exploration analysis (EDA) 49

2.1 Data overview 49

2.2 Descriptive Analysis 50

2.3 Maximum data domain estimation 52

3 Differential Privacy mechanisms implementation 54

4 Data perturbed evaluation 61

4.1 RFM Analysis for dataset 61

Trang 12

4.2 Forecasting trendline at categories, consumer-groups, and store level 65

4.3 Privacy evaluation 73

4.4 Recommendation for using Differential Privacy in data partnership cases 77 CHAPTER 6: CONCLUSION AND FUTURE WORKS 78

use-REFERENCES 80

APPENDIX 83

Table: Descriptive Statistics table of the data perturbation output 83

Table: Accuracy result of RFM analysis between data perturbations and original data 85

Table: Accuracy (RMSE) result from the forecast of data perturbations and original data 88

Table: Accuracy (RMSE) result of the adjusted forecast version 90

Trang 13

TABLE OF TABLES

Table 1 Original Shopping data table structure 39

Table 2 Shopping Data table structure for the use-case 40

Table 3 Data structure of the synthesis dataset 50

Table 4 Maximum domain value estimation for each Category 53

Table 5 Descriptive Statistics table of the data perturbation output (detail table in the Appendix) 60

Table 6 Accuracy result of RFM analysis between data perturbations and original data (detail in Appendix) 64

Table 7 Accuracy (RMSE) result from the forecast of data perturbations and original data (detail in Appendix) 70

Table 8 Accuracy (RMSE) result of the adjusted forecast version (detail in Appendix) 72

Table 9 Data Utility and Data Privacy improving ratio 74

Trang 14

TABLE OF FIGURES

Figure 1 A fictitious database with hospital data, taken from [14] 8

Figure 2 Visualize how Sequential and Parallel Composition works – take from [25] 25

Figure 3 Visualize how Local Privacy and Global Privacy works - take from [25] 27 Figure 4 The Laplace mechanism with multiple scales 28

Figure 5 The visualization process of LPA and DFT (or FPA) - take from [19] 32

Figure 6 The visualization of Backward Perturbation mechanism (BPA) - take from [4] 34

Figure 7 The visualization of Forward Perturbation mechanism - take from [4] 35

Figure 8 The cost analysis table for BPA and FPA method - take from [4] 36

Figure 9 The pseudo-code for Threshold Mechanism - taken from [4] 36

Figure 10 The process of how STL-DP mechanism works - take from [27] 37

Figure 11 EDA - Fetch the synthesis dataset and display on Python 49

Figure 12 EDA - Code block and display for dataset descriptive statistics 50

Figure 13 EDA - Code block for dataset aggregation to find TOP 5 huge value consumers 51

Figure 14 EDA - Code block to visualize the purchase pattern over year by week of the dataset (aggregation) 52

Figure 15 Box-plot to visualize the quantity range - support for estimate the maximum data domain 54

Figure 16 Code-block - Import related libraries and initialize Ray multiprocessing 55

Figure 17 Code-block - Implementation of Laplace Perturbation Mechanism on time-series data 56

Figure 18 Code-block - Implementation of Fourier Perturbation Mechanism on time-series data 56

Figure 19 Code-block - Implementation of STL-DP Mechanism on time-series data 57

Trang 15

Figure 20 Code-block - Setup the Epsilon and Maximum data domain values 58

Figure 21 Code-block function to run the data perturbation process 59

Figure 22.Code-block for multi-processing run the whole process 60

Figure 23 Code-block to implement RFM Analysis 63

Figure 24 Code-block to measure accuracy of RFM clusters for data perturbation methods and the clusters of original data 64

Figure 25 EDA - Code-block to calculate number of timeseries/groups to conduct the forecast in next section 66

Figure 26 EDA - Code-block to calculate number of timeseries/groups to conduct the forecast in next section - Select only value > 300 days/year 67

Figure 27 Code-block for building data lookup dictionary for RFM bins 67

Figure 28 Code-block for running Simple Linear Regression to forecast, combine with the parameter M lookup table 68

Figure 29 Code-block for RMSE calculation for the forecast output 69

Figure 30 Adjusted version code-block to run Simple Linear Regression on the data perturbations and original data 71

Figure 31 Data Utility and Privacy trade-off (RMSE normal scale) 73

Figure 32 Data Utility and Privacy trade-off (RMSE Log10 scale) 74

Figure 33 EDA - Check the TOP 5 high value consumers in multiple methods (Original data, FPA, tFPA) 75

Figure 34 EDA - Plot Line chart to compare 3 methods and original volume of USER 944 76

Figure 35 EDA - Plot Line chart to compare 3 methods and original volume of USER 944 (by category) 77

Trang 16

CHAPTER 1: OVERVIEW OF THE THESIS 1 Background and Context

Data privacy, also referred to as information privacy or data protection, encompasses the fundamental right of individuals to exert control over the collection, usage, sharing, and storage of their personal data by others Personal data encompasses any information that identifies or relates to an individual, including but not limited to their name, address, phone number, email, health records, financial transactions, online activities, preferences, and opinions

Within the realm of personal data, time-series data poses unique challenges and sensitivities in terms of privacy protection Time-series data contains temporal information that can unveil patterns, trends, and behaviors of individuals or groups over time Examples of personal time-series data include weblogs, social media posts, GPS traces, health records, among others While time-series data serves various purposes, such as forecasting, anomaly detection, classification, clustering, and summarization, it also carries inherent risks if privacy breaches occur

For instance, an attacker could potentially deduce the identity of a user by matching their GPS traces or weblogs with publicly available information Similarly, the analysis of temporal patterns or correlations within sensor readings or social media posts could lead to the identification of a user Moreover, the observation of changes or trends over time in health records can reveal an individual's preferences, habits, and activities These scenarios underscore the significance of safeguarding privacy when dealing with time-series data to mitigate the potential risks associated with unauthorized access or misuse of personal information

However, in the world of collaboration, there is a real need for data sharing between multiple researchers from multiple corporates to conduct the analysis that benefits humankind (e.g COVID-19 mobility data sharing from Google and Apple) At that stage, the data will be published and shared between the parties, and while it might

Trang 17

apply anonymization techniques before publishing, there is still a significant risk of reverting identities For instances,

• In 2006, AOL released a dataset of 20 million search queries from 650,000 users over a three-month period [1] The dataset was intended for research purposes and was anonymized by replacing user IDs with random numbers However, researchers and journalists were able to re-identify some users by analyzing their search queries and linking them with other public information, such as news articles, blogs, or social media profiles Some users were exposed to embarrassment, harassment, or legal troubles because of their search histories being revealed

• In 2007, Netflix released a dataset of 100 million movie ratings from 480,000 users (about half the population of Montana) over a six-year period The dataset was intended for a contest to improve Netflix's recommendation system and was anonymized by removing usernames and other identifying information However, researchers [2] were able to re-identify some users by comparing their movie ratings with those on another movie website, IMDb Some users were exposed to privacy breaches as their movie preferences could reveal their political views, sexual orientation, or health conditions

2 Data Publishing and Privacy Preserving Data Publishing

Data publishing is the process of making data available to the public or other parties for analysis, research, or decision-making However, data often contains sensitive information about individuals, such as their identities, preferences, health conditions, or financial records If such data is published without proper protection, it may lead to privacy breaches and harm the individuals' interests or rights

Privacy-preserving data publishing (PPDP) is a field of study that aims to protect individuals' privacy while maintaining the data's utility for legitimate purposes PPDP involves applying various techniques to transform or anonymize the data before publishing it, such that the sensitive information is hidden or distorted, but the general patterns or statistics are preserved

Trang 18

There are different types of data that can be published, such as tabular data (e.g., census records, medical records) or graph data (e.g., social networks, web graphs) Depending on the data type and the privacy requirements, different PPDP

techniques can be used Some of the common PPDP techniques are (adapted from [7] and [16]):

• Generalization: This technique replaces specific values in the data with more general ones, such as replacing ages with age ranges or zip codes with regions Generalization reduces the granularity of the data and makes it harder to identify individuals based on their attributes

• Suppression: This technique removes some values or records from the data entirely, such as deleting names or phone numbers or omitting outliers Suppression reduces the amount of information in the data and makes it harder to link individuals across different sources

• Perturbation: This technique adds noise or randomness to the data, such as swapping values among records, adding or subtracting a small

amount from numerical values, or flipping bits in binary values Perturbation alters the original values in the data and makes it harder to infer the true values based on statistical analysis

• Encryption: This technique transforms the data using cryptographic methods, such as hashing, encryption, or homomorphic encryption

Encryption protects the confidentiality of the data and makes it harder to access the original values without a secret key

3 Challenges of Privacy Preserving Data Publishing (PPDP) for series data

Time-As noted above, PPDP is a research area that aims to protect the privacy of individuals or organizations while allowing valuable data analysis on published data Time-series data is a type of data that records the values of certain variables over time, such as sensor readings, stock prices, or health records PPDP for time-series data poses several challenges, such as (adapted from [7] and [16]):

Trang 19

• Correlation: Time-series data often exhibit strong temporal or spatial correlation, which means that some attributes’ values depend on the values of other attributes or previous time points This can lead to privacy breaches if an adversary can exploit this correlation to infer sensitive information from the published data

• Dynamics: Time-series data is often updated or streamed

continuously, which requires efficient and adaptive PPDP methods that can handle dynamic changes and updates without compromising privacy or utility

4 Differential Privacy as a powerful player

To address these challenges above, there is a principle that is accepted and explored by many researchers nowadays, called Differential Privacy – invented in 2006 by

Professor Cynthia Dwork (adapted from [7], [11], [16], [21], [22])

Differential privacy is a commitment made by a data holder or curator to a data subject: No matter what other research, data sets or information sources are

available, enabling the use of your data in any study or analysis will have no effect on you, whether positive or negative At their finest, differential private database systems may make secret data publicly accessible for effective data analysis without the need for clean data rooms, data usage agreements, or limited views

Differential privacy guarantees that the same conclusions, such as smoking causes cancer, will be made regardless of whether a person opts in or out of the data

collection Particularly, it assures that each output sequence (responses to queries) is "basically" equally likely to occur, regardless of the existence or absence of any individual Here, probabilities are taken over random selections made by the privacy mechanism (something controlled by the data curator), and the phrase "basically" is represented by the parameter

Differential privacy can be achieved by adding carefully calibrated noise to the original data or to the output of a data analysis function Differential privacy has

Trang 20

been applied to time-series data in various scenarios, such as smart metering, location-based services, or health monitoring

In this thesis, we will explore more Differential Privacy from theory to some proposed solutions to protect privacy of time-series data

• To explore and solve the privacy use-case in the data partnership with Differential Privacy and other techniques Then, build the process to apply privacy techniques into business collaboration

7 Thesis structure

Chapter 1: Background and context (this chapter) – summaries of the purpose of data publishing, the purpose of privacy protection for data released, the challenges on time-series data, and the brief of the Differential Privacy principle

Chapter 2: Literature review – summaries and compares privacy attack models with notable algorithms, quickly analyses the weaknesses of its and the connection to Differential Privacy

Chapter 3: Investigation on Differential Privacy – explore the mathematical concepts, theories, foundational mechanisms, and related techniques for time-series data

Trang 21

Chapter 4: Experiment designs – based on the synthesis use-case from real world problem, analyses and proposes a process and solution to protect privacy of individuals in data partnership process

Chapter 5: Experiment implementations – based on the proposal in Chapter 4 and utilize knowledge in Differential Privacy technique and Data Analytics to

implement related requirements

Chapter 6: Conclusion and Future works

Trang 22

CHAPTER 2: PRIVACY MODELS RESEARCHS

1 Attack models and notable privacy models

1.1 Record linkage attack and k-Anonymity privacy model

Record linkage attacks (adapted from [7], [9], [16]) are privacy attacks that exploit the linkage of records across different data sources to reveal sensitive information about individuals These attacks typically involve identifying records in different datasets that correspond to the same individual and then linking the records together to obtain additional information Usually, the linkage attack happens based on quasi-identifiers (QIDs): such as age, gender, or ZIP code, to link records that correspond to the same individual across different datasets

From the research of Sweeny [9], based on just ZIP code, date of birth, and gender, there is 87% chance to identify uniquely individuals in the US, which is a very high probability

To counter this attack, k-anonymity method has been proposed by Sweeny [9]: if one record in the table has some value QID, then at least k-1 other records also have the value QID, which means each record is indistinguishable from at least k-1 other records with respect to QID Therefore, the probability of linking a victim to a specific record through the QID is at most 1/k

k-anonymity uses data generalization and suppression techniques to achieve that every combination of values for the quasi-identifiers can be indistinguishably matched to at least k-individuals in one equivalence class

While k-anonymity can prevent identity disclosure, thus, for no individual it is possible to identify his exact record in the database, but it does not provide attribute disclosure Attribute disclosure occurs if it is possible, even without finding the exact record of an individual, to reveal his sensitive attributes

Trang 23

Figure 1 A fictitious database with hospital data, taken from [14]

1.2 Attribute linkage attack and l-diversity and t-closeness privacy model

The attribute linkage attack (adapted from [7], [14], [16]) is a method that enables an attacker to infer sensitive values of a target victim from the published data, using the set of sensitive values associated with the group that the victim belongs to, even when the attacker cannot accurately identify the victim's record K-anonymized databases suffer from attribute disclosure through two attack forms: the homogeneity attack and the background knowledge attack The homogeneity attack is possible when the sensitive attributes lack diversity, while the background knowledge attack relies on having partial knowledge of an individual or a distribution of sensitive and non-sensitive attributes in a population

In 2006, Machanavajjhala [14] proposed the l-diversity method to address the attribute linkage attack The core principle of l-diversity is to prevent homogeneity of sensitive attributes within equivalence classes Specifically, a database is considered l-diversed if there exist at least l distinct values of the sensitive attribute in every equivalence class To break the l-diversity, an adversary seeking to breach the privacy of an individual within the database would need to possess (l-1) pieces of background knowledge

Trang 24

However, a drawback of l-diversity is that it can be too difficult or unnecessary to achieve Imagine a scenario with 10000 data records with a sensitive attribute Let the class be A for 99% of the individuals and B for the remaining 1% Both possible values have different sensitivities One would not mind being tested negative, thus, in an equivalence class that only contains A-entries for the sensitive attribute, l-diversity is unnecessary But to achieve 2-diversity, there would only exist 10000 * 0.01 = 100 equivalence classes, which would lead to a large information loss

Both l-diversity and k-anonymity are insufficient to prevent attribute disclosure, as there are two conceivable attacks: the skewness attack and the similarity attack If the overall distribution of sensitive attributes is skewed, as in the example of 99% to 1%, a skewness attack is conceivable Imagine that one equivalence class contained the same amount of positive and negative records The database would still be 2-diverse, but the risk of anyone in that class being considered positive would increase from 1% to 50%, which could compromise the privacy of individuals Similarity attacks are possible if the sensitive attribute values within a class of equivalence are distinct but semantically similar Consider once more the hospital example and the equivalence class with the sensitive attribute values pulmonary cancer, breast cancer, and brain cancer These are distinct values, but they all represent a form of cancer; therefore, a person in this equivalence class has cancer

To overcome the similarity attacks and skewness attacks of l-diversity, Li et al [15] introduced t-closeness, which is a privacy model that is more stringent than l-diversity It requires that the distribution of the sensitive attribute values in each equivalence class be t-close to the distribution of the sensitive attribute values in the entire dataset This ensures that an attacker cannot infer the value of a sensitive attribute for an individual record by simply looking at the distribution of the sensitive attribute values in the equivalence class that the record belongs to

In other words, t-closeness ensures that the distribution of the sensitive attribute values in each equivalence class is not significantly different from the distribution of

Trang 25

the sensitive attribute values in the entire dataset This makes it more difficult for an attacker to infer the value of a sensitive attribute for an individual record

Although t-closeness is better than l-diversity in terms of protection, it has big drawbacks, due to the expensive computational, and big loss in data utility Therefore, in the era of big-data, l-diversity and t-closeness are hardly mentioned as best-practices

1.3 Table linkage and δ-presence privacy model

The concepts of record linkage and attribute linkage typically assume that the attacker is already aware of the presence of the victim's record in the published table However, this may not always be the case In some situations, the mere existence or absence of the victim's record in the dataset can reveal sensitive information For instance, consider a scenario where a medical facility releases a data table concerning a specific illness The mere presence of the victim's record in the database can have detrimental consequences Table linkage occurs when an attacker can reliably deduce either the presence or absence of the victim's record in the published table

In 2007, Nergiz (adapted from [7], [16]) proposed a solution to the table linkage problem called the delta-presence (δ-presence) approach The fundamental idea is to constrain or "bound" the probability of inferring the presence of each potential victim's data within a designated range referred to as δ The δ-presence approach offers the advantage of indirectly preventing record and attribute linkages If an attacker has a confidence level of at most δ% regarding the presence of the victim's record in the leaked table, then the likelihood of successfully linking to their record and sensitive attributes is limited to at most percent Therefore, δ-presence can indirectly mitigate the risks associated with record and attribute connections

While the δ-presence approach provides a "safe" privacy model, it relies on the assumption that the data publisher has access to the same external table as the attacker, which may not be a realistic assumption in practice

Trang 26

1.4 Probabilistic linkage and Differential Privacy model

There exists a distinct category of privacy models that diverge from the precise identification of records, attributes, and tables that an attacker can link to a specific target individual Instead, these models focus on the alteration of an attacker's probabilistic beliefs concerning the sensitive information of a victim following access to published data The overarching objective of this family of privacy models is to uphold the uninformative principle, which strives to minimize the discrepancy between prior and posterior beliefs

In 2006, Cynthia Dwork (adapted from [7], [12], [16], [22]) introduced a

compelling concept known as Differential Privacy, emphasizing the protection of individuals' privacy in statistical databases Dwork posited that the participation of record owners in such databases should not significantly heighten the risk to their privacy Rather than comparing the prior and posterior probabilities before and after accessing the published data, Dwork proposed comparing the risk with and without the inclusion of the record owner's data in the public dataset Consequently, Dwork formulated a privacy model termed differential privacy, aimed at ensuring that the addition or removal of a single database record has an inconsequential impact on any analysis outcomes This implies that the merging of separate databases does not expose the organization to additional risks

Differential privacy does not specifically prevent record and attribute linkages, which were discussed in earlier sections However, it provides record owners with the assurance that submitting their personal information to the database is secure, as the outcome of the anonymization method does not significantly differ whether the record owner discloses their true information This assertion aligns with the idea Importantly, differential privacy furnishes record owners with the confidence that their personal information can be submitted to the database without revealing anything substantial, or anything, that could not have been discerned without their information Dwork's work offers a formal demonstration of how differential privacy can guarantee protection against attackers possessing various background knowledge

Trang 27

This robust guarantee is achieved by comparing the published data with and without the contribution of the record owner's data

2 Summary

The fundamental principles of data privacy protection and their widespread adoption can be acknowledged, although it is widely recognized that no singular solution can fully guarantee privacy The concept of k-Anonymity was developed to safeguard individuals against record linkage However, in cases where k-Anonymity proves inadequate in preventing attribute linkage, the notion of l-Diversity was introduced, subsequently leading to the development of a more stringent variant known as t-Closeness To address the risks associated with the presence of an individual's information in a dataset, the concept of δ-presence was formulated Furthermore, Cynthia Dwork approached privacy concerns from both a priori and posterior perspective by devising Differential Privacy, a mathematical framework that incorporates measurable noise to model privacy problem

Based on extensive research ([11], [21], [25]), Differential Privacy offers remarkable advantages, as summarized below:

• Protection against various attack models: Unlike other privacy mechanisms, Differential Privacy does not require detailed modeling of specific attack models Regardless of the attacker's goals or background knowledge, the mechanisms proven to satisfy Differential Privacy can still effectively protect privacy

• Measurable privacy loss: While the addition of noise to data for privacy protection has been practiced since the 1970s, using techniques such as Statistical Disclosure Control, Differential Privacy provides a means to quantify the amount of noise necessary for a privacy solution while balancing it with data utility

• Composition of multiple mechanisms: With strong mathematical proofs, the "composition" technique in Differential Privacy enables the combination of

Trang 28

multiple privacy algorithms, resulting in a more robust and reasonable solution

Consequently, the forthcoming chapter will focus on the exploration and investigation of Differential Privacy, including its application in real-world scenarios

Trang 29

CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY 1 The need for Differential Privacy principle

As mentioned in the previous chapters, Differential Privacy could be a key solver for the limitation of k-Anonymity, l-Diversity and its variant when dealing with multiple attack types and could improve performance because of its low complexity in calculation We can explore some further good reasons to consider using DP here (adapted from [11], [21], [25])

1.1 No need to model the attack model in detail

All previous privacy models required certain assumptions about the capabilities and objectives of the attacker Selecting the appropriate notion of privacy necessitated understanding the extent of the attacker's prior knowledge, the auxiliary data they could utilize, and the specific information they aimed to uncover

Implementing these definitions in practice proved challenging and prone to errors Addressing these questions was a delicate task, as the attacker's intentions and capabilities might not be fully known or anticipated Furthermore, there were potential unknown unknowns, representing unforeseen attack vectors that could undermine privacy Consequently, making definitive statements based on these traditional definitions was difficult, as they relied on assumptions that could not be guaranteed with absolute certainty

In contrast, the adoption of differential privacy provides two remarkable guarantees:

• Protection of all forms of individual information: Regardless of the attacker's objectives, whether it is re-identifying a target, determining their presence in the dataset, or inferring sensitive attributes, all such endeavors are safeguarded Consequently, there is no need to explicitly consider the attacker's goals when applying differential privacy

• Robustness against the attacker's knowledge of the data: Even if the attacker possesses knowledge about certain individuals in the database or attempts to introduce fake users, differential privacy remains effective

Trang 30

Privacy guarantees extend to individuals whom the attacker is unfamiliar with, ensuring their protection regardless of the attacker's level of

knowledge

By embracing differential privacy, organizations can confidently safeguard personal information across various scenarios, irrespective of the attacker's goals or

knowledge about the data

1.2 Quantifiable privacy loss

Differential privacy, unlike older privacy notions, introduces a numeric parameter that can be adjusted However, the significance of this parameter differs

significantly Consider the case of k-anonymity, where the parameter k signifies that each record in the output dataset resembles at least k-1 other records However, the value of k alone does not provide a clear indication of the level of privacy

protection afforded

The relationship between the value of k and the actual privacy of the dataset is tenuous at best Selecting an appropriate value for k is often subjective and lacks a formal justification This issue becomes even more pronounced with other

conventional privacy definitions

In contrast, differential privacy offers a more meaningful approach It allows for the quantification of the maximum potential information gained by an attacker The corresponding parameter, denoted as ε, enables formal statements to be made For instance, if ε = 1.1, it can be stated that "an attacker who initially believes their target is present in the dataset with a probability of 50% can increase their level of certainty to no more than 75%." While determining the precise value of ε may pose challenges, it can be interpreted and reasoned about in a formal manner

Furthermore, the flexibility of differential privacy allows for various modifications in the statement mentioned above It is possible to replace "their target is in the dataset" with any assertion regarding an individual, and the addition of "no matter what the attacker knows" can be incorporated to enhance precision Collectively,

Trang 31

these characteristics make differential privacy significantly stronger than all preceding definitions, particularly in terms of attack modeling and the ability to quantify privacy guarantees

1.3 Multiple mechanisms composition

Consider a scenario where you possess a dataset that you intend to share with two trusted individuals, Alice and Bob, in an anonymized form While you trust both Alice and Bob equally, they have distinct interests in the data, prompting you to provide them with different versions of the anonymized dataset, adhering to the same privacy definition

However, a potential concern arises if Alice and Bob conspire and compare the data they received In most privacy definitions, combining two k-anonymous versions of the same data does not preserve k-anonymity Consequently, the union of their anonymized versions may no longer maintain the desired level of anonymity This collaboration between Alice and Bob could potentially lead to the reidentification of individuals or even the complete reconstruction of the original data, which is a significant privacy risk

Differential privacy, on the other hand, avoids this vulnerability Suppose you provided differentially private data to both Alice and Bob, utilizing a privacy parameter of ε for each instance If they conspire and combine their data, the resulting dataset remains protected by differential privacy However, the level of privacy becomes weaker, with the privacy parameter becoming 2ε While some information may still be gained through their collaboration, the extent of this information gain can be quantified This property is known as composition

Although this scenario may seem unlikely, the concept of composition is highly practical and valuable in real-world situations Organizations often require multiple data applications, such as publishing statistics, releasing anonymized versions, and training machine learning algorithms Composition enables the organization to

Trang 32

maintain control over the level of risk as new use cases emerge and processes evolve, providing a robust framework for privacy management

2 The promise (and not promised) of Differential Privacy

Adapted from [6]: Differential Privacy: A Primer for a Non-technical Audience (2017) of professor Kobbi Nissim and his team

2.1 The promise

The case: “Researchers selected a sample of individuals to participate in a survey

exploring the relationship between socioeconomic status and medical outcomes across several U.S cities Individual respondents were asked to complete a questionnaire covering topics such as where they live, their finances, and their medical history One of the participants, John, is aware that individuals have been re-identified in previous releases of de-identified data and is concerned that personal information he provides about himself, such as his HIV status or annual income, could one day be revealed in de-identified data released from this study If leaked, the personal information John provides in response to the questionnaire used in this study could lead to an increase in his life insurance premium or an adverse decision on a mortgage application he submits in the future.”

Analysis: Differential privacy aims to protect John’s privacy in the real-world

scenario in a way that mimics the privacy protection he is afforded in his opt-out scenario Accordingly, what can be learned about John from a differential private computation is (essentially) limited to what could be learned about him from everyone else’s data without his own data being included in the computation Crucially, this very same guarantee is made not only with respect to John, but also with respect to every other individual contributing his or her information to the analysis! A more precise description of the differential privacy guarantee requires the use of formal mathematical language, as well as technical concepts and

reasoning that are beyond the scope of this document Rather than providing a full,

Trang 33

precise definition, this document offers a few illustrative examples to discuss various aspects of differential privacy in a way we hope is intuitive and accessible

2.2 The not promise

The case: “Suppose Alice is a friend of John’s and possesses some knowledge

about him, such as that he regularly consumes several glasses of red wine with dinner Alice later learns of a medical research study that found a positive correlation between drinking red wine and the occurrence of a certain type of cancer She might therefore conclude, based on the results of this study and her prior knowledge of John’s drinking habits, that he has a heightened risk of developing cancer.”

Analysis: It may seem at first that the publication of the results from the medical

research study enabled a privacy breach by Alice After all, learning about the study’s findings helped her infer new information about John, i.e., his elevated cancer risk However, notice how Alice would be able to infer this information about John even if John had not participated in the medical studies, it is a risk that exists in both John’s opt-out scenario and the real-world scenario In other words, this risk applies to everyone, regardless of whether they contribute personal data to the study or not

2.3 Conclusion

While Differential Privacy is effective in safeguarding privacy when publishing datasets, it is crucial to acknowledge that certain information inference scenarios pose inherent risks of inference even in the absence of the published dataset The scope of Differential Privacy does not encompass this realm Differential Privacy primarily operates under the assumption that it assists in preventing an attacker from altering their posterior perspective after obtaining the dataset, regardless of whether they possess or lack access to the dataset, thereby maintaining their pre-existing knowledge and beliefs about the intended target

3 Formal definition of Differential Privacy

Trang 34

3.1 Terms and notations

Knowledge adapted from [12], [20], [23], [25]

Data curator A data curator manages the collected data throughout its life cycle

This can include data sanitization, annotation, publication, and presentation The goal is to ensure that the data can be reliably reused and preserved In DP, the data curator is responsible for ensuring that the privacy of no individual represented in the data can be violated

Adversary The adversary represents a data analyst that is interested in finding out

(sensitive) information about the individuals in the dataset In the context of DP even the legitimate user of the database is referred to as an adversary, as his analyses can damage the individual’s privacy

L1-norm The L1-norm of a database D is denoted by | |𝐷 | |1 It measures the size of D (e.g.: number of records it contains) and can be defined as

2 )

Neighboring databases Two databases 𝐷1 𝑎𝑛𝑑 𝐷2 are called neighboring if they differ only in at most ONE element This can be expressed by this:

| |𝐷1 − 𝐷2 | |1 ≤ 1

Mechanism Differential Privacy itself is only an abstract concept (or principle) The

realization of its mathematical guarantees is done by a concrete mechanism (or an algorithm) that releases statistical information about the dataset

Trang 35

3.2 Randomized algorithm

Differential privacy is a property of some randomized algorithms So, we will start by defining what a randomized algorithm is We will mostly focus on randomized algorithms whose probabilistic space is discrete To formalize this, we will use the notion of probability simplex

Definition 3.1 (Probability simplex) Given a discrete set B, the probability simplex over B, denoted 𝛥(𝐵) is the set:

3.3 𝜺-differential privacy

Trang 36

Definition 3.3 Let 𝜀 > 0 Define a randomized function M to be (𝜀) differentially private if for all neighboring input datasets 𝐷1 𝑎𝑛𝑑 𝐷2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆]𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] ≤ 𝑒

where the probability is taken over the coin tosses of M

The definition also implies a lower bound: since we can interchange D1 and D2 mutually:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆] ≤ 𝑒𝜀 ∗ 𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] We can also obtain the constraint using natural logarithm:

• Accuracy: Using logarithmic probabilities improves numerical stability when probabilities are very small, because of the way in which computers approximate real numbers With normal probabilities they produce more rounding errors

• Simplicity: Many probability distributions, especially the ones from which the random noise is drawn, have exponential form Taking the

logarithm of these distributions eliminates the exponential function, making it possible to calculate with the exponent only

3.4 (𝜺, 𝜹) differential privacy

Trang 37

𝜀 -DP has a high privacy requirement for mechanisms But since adding too much noise to the original data can limit the amount of information drastically, several weakened versions of DP have been proposed One of the most popular versions is (𝜀, 𝛿) − Differential Privacy

Definition 3.4 (𝜀, 𝛿) differential privacy

Define a randomized function M to be (𝜀) differentially private if for all

neighboring input datasets 𝐷1 𝑎𝑛𝑑 𝐷2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

𝑃𝑟[𝑀(𝐷1) ∈ 𝑆] <= 𝑒𝜀 ∗ 𝑃𝑟[𝑀(𝐷2) ∈ 𝑆] + 𝛿 In this case the two parameters 𝜀 and δ control the level of privacy

The strongest version of differential privacy, in which δ = 0, is known as pure differential privacy while the more general case where δ > 0 is known as

approximate differential privacy The meaning of δ can be seen as the probability of data leak Sometimes, this raises concerns regarding its presence in a privacy

definition However, there is nothing as a silver bullet for all privacy problems, and everything in life needs to do risk management Therefore, it’s reasonable to

consider δ as the real-life risk factor and have proper scenario planning for it

This shows that δ should be small Intuitively, that makes sense, because, with a large δ, even if 𝜀 = 0 (perfect secrecy), a mechanism that is (0, δ)-differentially private, would breach privacy with high probability A common heuristic to choose δ for a database with n records is 𝛿 < 1

𝑛 This is because with an (𝜀, δ)- DP mechanism, the privacy for each record in the database is given away with

probability δ But noted, (𝜀, δ)-DP is much weaker than (𝜀, 0)-DP, even when δ is very small in relation to the size n of the database

4 Important concepts of Differential Privacy

Trang 38

4.1 The sensitivity

This property facilitates the estimation of the required level of noise perturbation in the differential privacy mechanism, which quantifies the extent to which the output of a function will change when its input is altered

4.1.1 Global sensitivity

It demonstrates the maximum differences between the query results of neighboring databases that are utilized in one of the differentially private mechanisms The formal definition:

𝐺𝑆(𝑓) = ||𝑓(𝐷1) − 𝑓(𝐷2)||

Consider the COUNT function as an example, where GS(f) = 1 This is because of the addition of a row to the dataset can, at most, increase the query's output by 1: either the new row possesses the desired attribute, leading to a count increment of 1, or it lacks the attribute, maintaining the same count (the count may correspondingly decrease when a row is removed)

4.1.2 Local sensitivity

Thus far, we have exclusively examined one metric of sensitivity, namely global sensitivity Our definition of global sensitivity considers any two neighboring datasets However, this approach may be overly pessimistic because when running our differential privacy mechanisms on a real dataset, it would be more relevant to consider the neighbors of that specific dataset

This is where the concept of local sensitivity comes into play Local sensitivity considers one dataset as the actual dataset being queried and considers all its

neighboring datasets Unlike global sensitivity, it is not possible to discuss the local sensitivity of a function without simultaneously considering the dataset at which that local sensitivity occurs

Trang 39

Formally, the local sensitivity is defined:

𝐿𝑆(𝑓, 𝐷2) = ||𝑓(𝐷1) − 𝑓(𝐷2)||1

Local sensitivity offers a valuable means of establishing finite bounds on the sensitivity of certain functions, especially when it is challenging to bound their global sensitivity The mean function serves as an illustrative example

We computed differentially private means by dividing the query into two separate queries: a differentially private sum (the numerator) and a differentially private count (the denominator) Through sequential composition and post-processing (which will be covered later), the resulting quotient ensures differential privacy

We use this approach because the extent to which the output of a mean query may change when a row is added to or removed from the dataset is contingent upon the dataset's size If we were to bound the global sensitivity of a mean query, we would need to assume the worst-case scenario: a dataset of size 1 However, this

assumption is overly pessimistic for large datasets, and thus the "noisy sum over noisy count" methodology is significantly more advantageous

4.2 Privacy composition

Any approach to privacy should address the issues of composition Composition is the execution of several queries on the same dataset These queries can be

independent, dependent, or even operate on each other’s outputs DP can capture composition within one and within several different DP mechanisms Certainly, the parameters (𝜀, 𝛿) degrade There are two forms of composition, sequential and parallel composition

Trang 40

Figure 2 Visualize how Sequential and Parallel Composition works – take from [25]

4.2.1 Sequential composition

This helps bound the total privacy cost of releasing multiple results of differentially private mechanisms on the same input data Formally, the sequential composition theorem for differential privacy:

• If 𝐾1(𝑥) satisfies ϵ1-differential privacy • And 𝐾2(𝑥) satisfies 𝜖2-differential privacy

• Then 𝐺(𝑥) = (𝐾1(𝑥), 𝐾2(𝑥)) which releases both results satisfies ϵ1+ ϵ2-differential privacy

Sequential composition is a vital property of differential privacy because it enables the design of algorithms that consult the data more than once Sequential

composition is also important when multiple separate analyses are performed on a single dataset, since it allows individuals to bound the total privacy cost, they incur by participating in all these analyses The bound-on privacy cost given by

sequential composition is an upper bound - the actual privacy cost of two differentially private releases may be smaller than this, but never larger

4.2.2 Parallel composition

This can be seen as an alternative to sequential composition - a second way to calculate a bound on the total privacy cost of multiple data releases Parallel composition is based on the idea of splitting your dataset into disjoint chunks and