Data Mining and Knowledge Discovery Handbook, 2 Edition part 72 potx

690 Vicenç Torra 35.2.1 Computation-Driven Protection Procedures: the Cryptographic Approach As stated above, cryptographic protocols are often applied in applications where the analysis (or function) to be computed from the data is known. In fact, it is usually applied to scenarios with multiple data sources. We illustrate below this scenario with an example. Example 1. Parties P 1 , ,P n own databases DB 1 , ,DB n . The parties want to compute a function, say f , of these databases (i.e., f (DB 1 , ,DB n )) without revealing unnecessary information. In other words, after computing f (DB 1 , ,DB n ) and delivering this result to all P i , what P i knows is nothing more than what can be deduced from his DB i and the function f . So, the computation of f has not given P i any extra knowledge. Distributed privacy preserving data mining is based on the secure multiparty computation, which was introduced by A. C. Yao in 1982 (Yao, 1982). For example, (Lindell and Pinkas, 2000) and (Lindell and Pinkas, 2002) defined a method based on cryptographic tools for computing a decision tree from two data sets owned by two different parties. (Bunn and Ostrovsky, 2007) discusses clustering data from different parties. When data is represented in terms of records and attributes, two typical scenarios are considered in the literature: vertical partitioning of the data and horizontal partitioning. They are as follows. • Vertically partitioned data. All data owners share the same records, but different data owners have information about different attributes (i.e., different data owners have different views of the same records or individuals). • Horizontally partitioned data. All data owners have information about the same attributes, nevertheless the records or individuals included in their data bases are different. As stated above, for both centralized and distributed PPDM the only information that should be learnt by the data owners is the one that can be inferred from his original data and the final computed analysis. In this setting, the centralized approach is considered as a reference result when analyzing the privacy of the distributed approach. Privacy leakage for the distributed approach is usually analyzed considering two types of adversaries. • Semi-honest adversaries. Data owners follow the cryptographic protocol but they anal- yse all the information they get during its execution to discover as much information as they can. • Malicious adversaries. Data owners try to fool the protocol (e.g. aborting it or sending incorrect messages on purpose) so that they can infer confidential information. Computation-driven protection procedures using cryptographic approaches present some clear advantges with respect to general purpose ones. The first one is the good quality of the computed function (analysis). That is, the function we compute is exactly the one the users want to compute. This is not so, as we will see later, when other general purpose protection methods are used. In this latter case, the resulting function is just an approximation of the function we would compute from the original data. At the same time, cryptographic tools ensure an optimal level of privacy. Nevertheless, this approach has some limitations. The first one is that we need to know beforehand the function (or analysis) to be computed. As different functions lead to different cryptographic protocols, any change on the function to be computed (even small ones) requires a redefinition of the protocol. A second disadvantage is that the computational costs 35 Privacy in Data Mining 691 of the protocols are very high. In addition, it is even harder when malicious adversaries are considered. (Kantarcioglu, 2008) discusses other limitations. One is that most literature only considers the types of adversaries described above (honest, semi-honest and malicious). No other types are studied. Another one is the fact that in these methods no trade-off can be found between privacy and information loss (they use the term accuracy). As we will see later in Sections 35.5 and 35.6, most general purpose protection procedures permit the user to select an appropriate trade-off between these two contradictory issues. When using cyptographic protocols, the only trade-off that can be implemented easily is the one between privacy and efficiency. 35.2.2 Data-driven Protection Procedures Given a data set, data-driven protection procedures construct a new data set so that the new one does not permit a third party to infer confidential information present in the original data. Different methods have been developed for this purpose. We will focus on the case where the data set is a standard file defined in terms of records and attributes (microdata following the jargon of statistical disclosure control). As stated above, we can also consider other types of data sets as e.g. aggregate data (tabular data following the jargon of SDC). All data-driven methods are similar in the sense that they construct the new data set reduc- ing the quality of the original one. As quality reduction might cause data to be unsuitable for a particular analysis, measures have been developed to evaluate in what extent the protected data set is still valid. These measures are known as information loss measures or utility measures. Data-driven procedures are much more efficient with respect to computational cost than the ones using cryptographic protocols. Nevertheless, this major efficiency is at the cost of not ensuring complete privacy. Due to this, some measures of risk have been developed to determine in which extent a protected data set ensures privacy. These measures are known as disclosure risk measures. These two families of measures, information loss and disclosure risk, are in contradiction and, thus, methods should look for an appropriate trade-off between risk and utility. Tools have been developed to visualize this trade-off and also to quantify this trade-off, so that protection methods can be compared. We will present some of the data protection procedures in Section 35.4, information loss measures in Section 35.5, and visualization methods in Section 35.6. Section 35.3 includes a description of the standard scenario for evaluating risk before reviewing disclosure risk measures. 35.3 Disclosure Risk Measures Disclosure risk is defined in terms of the additional confidential information (in general, additional knowledge) that an intruder can acquire from the protected data set. According to (Lam- bert, 1993, Paass, 1985), disclosure risk can be studied from two perspectives: • Identity disclosure. This disclosure takes place when a respondent is linked to a particular record in the protected data set. This process of linking is known as re-identification (of the respondent). • Attribute disclosure. In this case, defining disclosure as the disclosure of the identity of the individual is considered too strong. Disclosure takes place when the intruder can 692 Vicenç Torra learn something new about an attribute of a respondent, even when no relationship can be established between the individual and the data. That is, disclosure takes place when the published data set permits the intruder to increase his accuracy on an attribute of the respondent. This approach was first formulated in (Dalenius, 1977) (see also (Duncan and Lambert, 1986) and (Duncan and Lambert, 1989)). Interval disclosure is a measure, proposed in (Domingo-Ferrer et al., 2001) and (Domingo- Ferrer and Torra, 2001b), for attribute disclosure. It is defined according to the following procedure. Each attribute is independently ranked and a rank interval is defined around the value the attribute takes on each record. The ranks of values within the interval for an attribute around record r should differ less than p percent of the total number of records and the rank in the center of the interval should correspond to the value of the attribute in record r. Then, the proportion of original values that fall into the interval centered around their corresponding protected value is a measure of disclosure risk. A 100 percent proportion means that an at- tacker is completely sure that the original value lies in the interval around the protected value (interval disclosure). Identity disclosure has received much attention in the last years and has been used to evaluate different protection methods. Its formulation needs a concrete scenario. We present it in the next section. Some identity disclosure risk measures will be reviewed later using this scenario. 35.3.1 An Scenario for Identity Disclosure The typical scenario is to consider the protected data set and an intruder having some partial information about the individuals in the published data set. The protected data set is assumed to be a data file, and it is usual to consider that intruder’s information can be represented in the same way. See e.g. (Sweeney, 2002, Torra et al., 2006). Formally, we consider data sets X with the usual structure of r rows (records) and k columns (attributes). Naturally, each row contains the values of the attributes for an individual. Then, the attributes in X can be classified (Dalenius, 1986, Samarati, 2001, Torra et al., 2006) in three non-disjoint categories. • Identifiers. These are attributes that unambiguously identify the respondent. Examples are passport number, social security number, full name, etc. • Quasi-identifiers. These are attributes that, in combination, can be linked with external information to re-identify some of the respondents. Examples are age, birth date, gender, job, zipcode, etc. Although a single attribute cannot identify an individual, a subset of them can. • Confidential. These are attributes which contain sensitive information on the respondent. For example, salary, religion, political affiliation, health condition, etc. Using these three categories, an original data set X is defined as X = id||X nc ||X c , where id are the identifiers, X nc are the non-confidential quasi-identifier attributes, and X c are the confidential attributes. Let us consider the protected data set X  . X  is obtained from the application of a protection procedure to X. This process takes into account the type of the attributes. It is usual to proceed as follows. • Identifiers. To avoid disclosure, identifiers are usually removed or encrypted in a prepro- cessing step. In this way, information cannot be linked to specific respondents. • Confidential. These attributes X c are usually not modified. So, we have X  c = X c . 35 Privacy in Data Mining 693 • Quasi-identifiers. They cannot be removed as almost all attributes can be quasi- identifiers. The usual approach to preserve the privacy of the individuals is to apply protection procedures to these attributes. We will use ρ to denote the protection procedure. Therefore, we have X  nc = ρ (X nc ). Therefore, we have X  = ρ (X nc )||X c . Proceeding in this way, we allow third parties to have precise information on confidential data without revealing to whom the confidential data belongs to. In this scenario we have identity disclosure when an intruder, having some information described in terms of a set of records and some quasi-identifiers, can link his information with the published data set. That is, he is able to link his records with the ones in the protected data set. Then, if the links between records are correct, he will be able to obtain the right values for the confidential attributes. Figure 35.1 represents this situation. A represents the file with data from the protected data set (i.e., containing records from X  ) and B represents the file with the records of the intruder. B is usually defined in terms of the original data set X, because it is assumed that the intruder has a subset of X . In general, the number of records owned by the intruder and the number of records in the protected data file will differ. Reidentification is achieved using some common quasi-identifiers on both X and X  . They permit to link pairs of records (using record linkage algorithms) from both files, and, then, the confidential attribute is linked to the identifiers. At this point reidentification is achieved. Formally, following (Torra et al., 2006, Nin et al., 2007, Sweeney, 2002) and the nota- tion in Figure 35.1, the intruder is assumed to know the non-confidential quasi-identifiers X nc = {a 1 , ,a n } together with the identifiers Id = {i 1 ,i 2 , }. Then, the linkage is between identifiers (a 1 , ,a n ) from the protected data (X  nc ) and the same attributes from the intruder (X nc ). 35.3.2 Measures for Identity Disclosure Two main approaches exists for measuring identity disclosure risk. They are known by uniqueness and re-identification. We describe them below. • Re-identification. Risk is defined as an estimation of the number of re-identifications that might be obtained by an intruder. This estimation is obtained empirically through record linkage algorithms. This approach for measuring disclosure risk goes back, at least, to (Spruill, 1983) and (Paass, 1985) (using e.g. the algorithm described in (Paass and Wauschkuhn, 1985)). (Torra et al., 2006, Nin et al., 2007, Sweeney, 2002) are more recent papers using this approach. This approach is general enough to be applied in different contexts. It can be applied under different assumptions of intruder’s knowledge, and under different assumptions on protection procedures. It can even be applied when protected data has been generated using a synthetic data generator (i.e., data is constructed using a particular data model – see Section 35.4.3 for details). For example, (Torra et al., 2006) describes empirical results about using record linkage algorithms on synthetic data. The performance of different algorithms is discussed. (Winkler, 2004) considers a similar problem. • Uniqueness. Informally, the risk of identity disclosure is measured as the probability that rare combinations of attribute values in the protected data set are indeed rare in the original population. This approach is typically used when data is protected using sampling (Willenborg, 2001) (i.e., X  is just a subset of X ). Note that with perturbative methods it makes no sense to 694 Vicenç Torra (protected / public) identifiersquasi- identifiers quasi- identifiers confidential r 1 r a s 1 s b a 1 a n a 1 a n i 1 , i 2 , B (intruder)A a b Re-identification Record linkage Fig. 35.1. Disclosure Risk Scenario. investigate the probability that a rare combination of protected values is rare in the original data set, because that combination is most probably not found in the original data set. In the next sections we describe these two approaches in more detail. Uniqueness Two types of disclosure risk measures based on uniqueness can be distinguished: file-level and record-level. We describe them below. • File-level uniqueness. Disclosure risk is defined as the probability that a sample unique (SU) is a population unique (PU) (Elliot et al., 1998). According to (Elamir, 2004), this probability can be computed as P(PU |SU )= P(PU ,SU ) P(SU ) = ∑ j I(F j = 1, f j = 1) ∑ j I(f j = 1) where j = 1, ,J denotes possible values in the sample, F j is the number of individuals in the population with key value j (frequency of j in the population), f j is the same frequency for the sample and I stands for the cardinality of the selection. Unless the sample size is much smaller than the population size, P(PU |SU ) can be dangerously high; in that case, an intruder who locates a unique value in the released sample can be almost certain that there is a single individual in the population with that value, which is very likely to lead to that individual’s identification. 35 Privacy in Data Mining 695 • Record-level risk uniqueness. They are also known as individual risk measures. Dis- clsoure risk is defined as the probability that a particular sample record is re-identified, i.e. recognized as corresponding to a particular individual in the population. As (Elliot, 2002) points out, the main rationale behind this approach is that risk is not homogeneous within a data file. We summarize next the description given in (Franconi.Polettini.2004) of the record-level risk estimation. Assume that there are K possible combinations of key attributes. These combinations induce a partition both in the population and in the released sample. If the frequency of the k-th combination in the population was known to be F k , then the individual disclosure risk of a record in the sample with the k-th combination of key attributes would be 1/F k . Since the population frequencies F k are generally unknown but the sample frequencies f k of the combinations are known, the distribution of frequencies F k given f k is considered. Under reasonable assumptions, the distribution of F k |f k can be modeled as a negative binomial. The per-record risk of disclosure is then measured as the posterior mean of 1/F k with respect to the distribution of F k |f k . Record Linkage This approach for measuring disclosure risk directly follows the scenario in Figure 35.1. That is, record linkage consists of linking each record b of the intruder (file B) to a record a in the original file A. The pair (a, b) is a match if b turns out to be the original record corresponding to a. For applying record linkage, the common approach is to use the shared attributes (some quasi-identifiers). As the number of matches is an estimation of the number of re-identifications that an intruder can achieve, disclosure risk is defined as the proportion of matches among the total number of records in B. Two main types of record linkage algorithms are described in the literature: distance- based and probabilistic. They are outlined below. For details on these methods see (Torra and Domingo-Ferrer, 2003). • Distance-based record linkage. Each record b in B is linked to its nearest record a in A. An appropriate definition of a record-level distance has to be supplied to the algorithm to express nearness. This distance is usually constructed from distance functions defined at the level of attributes. In addition, we need to standardize attributes as well as assign weights to them. (Pagliuca et al., 1999) proposed distance-based record linkage to assess the disclosure risk for microaggregation. They used Euclidean distance and equal weight for all attributes. Later, in (Domingo-Ferrer and Torra, 2001b), distance-based record linkage (also with Euclidean distance and equal weights) was used for evaluating other masking methods as well. In their empirical work, distance-based record linkage outperforms probabilistic record linkage (See Section 35.3.2 below). The main advantages of using distances for record linkage are simplicity for the imple- menter and intuitiveness for the user. Another strong point is that subjective information (about individuals or attributes) can be included in the re-identification process by means of appropriate distances. The main difficulties for distance-based record linkage are (i) the selection of the appropriate distance function, and (ii) the determination of the weights. In relation to the distance function, for numerical data, the Euclidean distance is the most used distance. Nevertheless, other distances have also been used as e.g. Mahalanobis (Torra et al., 2006), and some Kernel-based ones (Torra et al., 2006). The difficulty of choosing a distance is 696 Vicenç Torra especially thorny in the cases of categorical attributes and of masking methods such as local recoding where the masked file contains new labels with respect to the original data set. The determination of the weights is also a rellevant problem that is difficult to solve. In the case of the Euclidean distance, it is common to assign equal weights to all attributes, and in the case of the Mahalanobis distance, this problem is avoided because weights are extracted from the covariance matrix. • Probabilistic record linkage Probabilistic record linkage also links pairs of records (a,b) in data sets A and B, respec- tively. For each pair, an index is computed. Then, two thresholds LT and NLT in the index range are used to label the pair as linked, clerical or non-linked pair: if the index is above LT , the pair is linked; if it is below NLT, the pair is non-linked; a clerical pair is one that cannot be automatically classified as linked or non-linked. When independence between attributes is assumed, the index can be computed from the following two conditional probabilities for each attribute: the probability P(1|M) of coincidence between the values of the attribute in two records a and b given that these records are a real match, and the probability P(0|U) of non-coincidence between the values of the attribute given that a and b are a real unmatch. To use probabilistic record linkage in an effective way, we need to set the thresholds LT and NLT and estimate the conditional probabilities P(1|M) and P(0|U) used in the computation of the indices. In plain words, thresholds are computed from: (i) the probability P(LP|U) of linking a pair that is an unmatched pair (a false positive or false linkage) and (ii) the probability P(NP|M) of not linking a pair that is a match (a false negative or false unlinkage). Conditional probabilities P(1|M) and P(0|U) are usually estimated using the EM algorithm (Dempster et al., 1977). The original description of probabilistic record linkage can be found in (Fellegi and Sunter, 1969) and (Jaro, 1989). (Torra and Domingo-Ferrer, 2003) describe the method in detail (with examples) and (Winkler, 1993) presents a review of the state of the art on probabilistic record linkage. In particular, this latter paper includes a discussion concern- ing non-independent attributes. A (hierarchical) graphical model has recently been proposed (Ravikumar and Cohen, 2004) that compares favorably with previous approaches. Probabilistic record linkage methods are less simple than distance-based ones. However, they do not require rescaling or weighting of attributes. The user only needs to provide the two probabilities P(LP|U) (false positives) and P(NP|M) (false negatives). The literature presents some other record linkage algorithms, some of which are variations of the ones presented here. For example, (Bacher et al., 2002) presents a method based on cluster analysis. The results are similar to the ones of distance-based record linkage as cluster analysis assigns objects (in this case, records) that are similar (in this case, near) to each other, to the same cluster. The algorithms presented here permit two records of the intruder b 1 and b 2 to be assigned to the same record a. There are algorithms that force different records in B to be linked to different records in A. The approaches described so far for record linkage do not use any information about the data protection process. That is, they use files A and B and try to re-identify as much records as possible. In this sense, they are general purpose record linkage algorithms. In the last years, specific record linkage algorithms have been developed. They take ad- vantage of any information available about the data protection procedure. That is, protection procedures are analyzed in detail to find flaws that can be used for computing more efficient, with larger matching rates, record linkage algorithms. Attacks tailored for two protection procedures are reported in the literature. (Torra and Miyamoto, 2004) was the first specific 35 Privacy in Data Mining 697 record linkage approach for microaggregation. More effective algorithms have been proposed in (Nin and Torra, 2009, Nin et al., 2008b) (for either univariate and multivariate microaggregation). (Nin et al., 2007) describes an algorithm for data protection using rank swapping. The scenario described above can be relaxed so that the published file and the one of the intruder do not share the set of variables. I.e., there are no common quasi-identifiers in the two files. A few record linkage algorithms have been developed under this premise. In this case, some structural information is assumed to be common in both files. (Torra, 2004) follows this approach. Its use for disclosure risk assessment is described in (Domingo-Ferrer and Torra, 2003). 35.4 Data Protection Procedures Protection methods can be classified into three different categories depending on how they manipulate the original data to define the protected data set. • Perturbative. The original data set is distorted in some way, and the new data set might contain some erroneous information. E.g. noise is added to an attribute following a N(0,a) for a given a. In this way, some combinations of values disappear, and, new combinations appear in the protected data set. At the same time, combinations in the protected data set no longer correspond to the ones in the original data set. This obfuscation makes disclosure difficult for intruders. • Non-perturbative. Protection is achieved through replacing an original value by another one that is not incorrect but less specific. For example, we replace a real number by an interval. In general, non-perturbative methods reduce the level of detail of the data set. This detail reduction causes different records to have the same combinations of values, which makes disclosure difficult to intruders. • Synthetic Data Generators. In this case, instead of distorting the original data, new ar- tificial data is generated and used to substitute the original values. Formally, synthetic data generators build a data model from the original data set and, subsequently, a new (protected) data set is randomly generated constrained by the model computed. An alternative dimension to classify protection methods is based on the type of data. Basic distinction is about numerical and categorical data, although other types of data (as e.g. time series (Nin and Torra, 2006)), sequences of events for location privacy, logs, etc. have also been considered in the literature. • Numerical. As usual, an attribute is numerical when arithmetic operations as e.g. sub- straction can be performed with it. Income and age are typical examples of such attributes. With respect to disclosure risk, numerical values are likely to be unique in a database and, therefore, leading to disclosure if no action is taken. • Categorical. In this case, the attribute takes values over a finite set and standard numerical operations do not make sense. Ordinal and nominal scales are typically distinguished among categorical attributes. In ordinal scales the order between values is relevant (e.g. academic degree), whereas in nominal scales it is not (e.g. hair color). Therefore, max and min operations are meaningful in ordinal scales but not on nominal scales. Structured attributes is a subclass of categorical attributes. In this case, different categories are related in terms of subclasses or member of relationships. In some cases, a hierarchy between categories can be inferred from these relationships. Cities, counties, and provinces are typical examples of these hierarchical attributes. For some attributes, the hierarchy is given but for others not but constructed by the protection procedure. 698 Vicenç Torra In the next sections we review some of the existing protection methods following the clas- sification above. Some good reviews on data protection procedures are (Adam and Wortmann, 1989, Domingo-Ferrer and Torra, 2001a, Willenborg, 2001). In addition, we have a section about k-anonymity. As we will see later, k-anonymity is not a protection method but a general approach for avoiding disclosure up to a certain extent. Different instantiations exist, some using perturbative and some using non-perturbative procedures. In this section we will use X to denote the original data, X  to denote the protected data set, and x i,V to represent the value of the attribute V in the ith record. 35.4.1 Perturbative Methods In this section we review some of the perturbative methods. Among them, the ones that are most used by the statistical agencies are rank swapping and microaggregation (Felso et al., 2001), but the literature on privacy preserving data mining, more oriented to business-related applications, largely focus on additive noise and microaggregation. Microaggregation and rank swapping are simple and have a low computational cost. Most of the methods described in this section, with some of their variants are implemented in the sdcMicro package in R (Templ, 2008) and in the μ -Argus software (Hundepool et al., 2003). Rank Swapping Rank swapping was originally proposed for ordinal attributes in (Moore, 1996), but also applied to numerical data in (Domingo-Ferrer and Torra, 2001b). It was classified in (Domingo- Ferrer and Torra, 2001b) among the best microdata protection methods for numerical attributes and in (Torra, 2004) among the best for categorical attributes. Rank swapping is defined for a single attribute V as described below. The application of this method to a data file with several attributes is done attribute-wise, in a sequential way. The algorithm depends on a parameter p that permits the user to control the amount of disclosure risk. Normally, p corresponds to a percent of the total number of records in X. • records of X (for the considered attribute V) are sorted in increasing order. • Let us assume, for simplicity, that the records are already sorted and that (a 1 , ,a n ) are the sorted values in X. That is, a i ≤ a  for all 1 ≤ i <≤n. • Each value a i is swapped with another value a  , randomly and uniformly chosen from the limited range i <≤ i + p. • The sorting step is undone. The algorithm shows that the smaller the p, the larger the risk. Note that when p increases the difference between x i and x  may increase accordingly. Therefore, the risk decreases. Nev- ertheless, in this case the differences between the original and the protected data set are higher, so information loss increases. (Nin et al., 2007) proves that specific attacks can be designed for this kind of rank swapping and proposed two alternative algorithms where the swapping is not constrained to a specific interval. In this way, the range for swapping includes the whole set (a 1 , ,a n ), although farther data have small probability of being swapped. In this way, the intruder cannot take ad- vantage of the closed interals in the attack. p-buckets and p-distribution rank swapping are the names of such algorithms. Other variants of rank swapping include (Carlson and Salabasis, 2002) and (Takemura, 2002). 35 Privacy in Data Mining 699 Microaggregation Microaggregation was originally (Defays and Nanopoulos, 1993) defined for numerical atributes (see also (Domingo and Mateo, 2002)) and later extended to categorical data (Torra, 2004) (see also (Domingo-Ferrer and Torra, 2005)) and to time series (Nin and Torra, 2006). (Felso et al., 2001) shows that microaggregation is a method used by many statistical agencies, and (Domingo-Ferrer and Torra, 2001b) shows that, for numerical data, is one of the methods with a better trade-off between information loss and disclosure risk. (Torra, 2004) describes its good performance in comparison with other methods for categorical data. Microaggregation is operationally defined in terms of two steps: partition and aggregation. • Partition. Records are partitioned into several clusters, each of them consisting of at least k records. • Aggregation. For each of the clusters a representative (the centroid) is computed, and then original records are replaced by the representative of the cluster to which they belong to. This approach permits protected data to satisfy privacy constraints, as all k records in the cluster are replaced by the same value. In this way, k controls the privacy in the protected data. We can formalize microaggregation using u ij to describe the partition of the records in X. That is, u ij = 1 if record j is assigned to the ith cluster. Let v i be the representative of the ith cluster, then a general formulation of microaggregation with g clusters and a given k is as follows: Minimize SSE = ∑ g i=1 ∑ n j=1 u ij (d(x j ,v i )) 2 Subject to ∑ g i=1 u ij = 1 for all j = 1, ,n 2k ≥ ∑ n j=1 u ij ≥ k for all i = 1, ,g u ij ∈{0,1} For numerical data it is usual to require that d(x,v) is the Euclidean distance. In the general case, when attributes V =(V 1 , ,V s ) are considered, x and v are vectors, and d becomes d 2 (x,v)= ∑ V i ∈V (x v −v V i ) 2 . In addition, it is also common to require for numerical data that v i is defined as the arithmetic mean of the records in the cluster. I.e., v i = ∑ n j=1 u ij x i / ∑ n j=1 u ij . In the case of univariate microaggregation (for Euclidean distance and arithmetic mean), there exists algorithms to find an optimal solution in polynomial time (Hansen and Mukherjee, 2003) (Algorithm 1 describes such method). In contrast, for multivariate data sets, the problem becomes an NP-Hard (Oganian and Domingo-Ferrer, 2000). For this reason, heuristic methods have been proposed in the literature. The general formulation given above permits us to apply microaggregation to multidi- mensional data. Nevertheless, when the number of attributes is large, it is usual to apply microaggregation to subsets of attributes; otherwise, the information loss is very high (Aggarwal, 2005). Individual ranking is a multivariate approach that consists of applying microaggregation to each of the attributes in an independent way. Alternatively, a partition of the attributes is constructed and microaggregation is applied to each subset. Applying microaggregation to subsets of attributes decrease information loss but at the cost of increasing disclosure risk. See (Nin et al., 2008a) for an analysis of how to build these partitions (i.e., whether it is preferable to select correlated or uncorrelated attributes when defining the partition) and their effect on information loss and disclsoure risk. (Nin et al., 2008a) shows that the selection of uncorrelated attributes decrease disclosure risk and can lead to a better trade-off between disclosure risk and information loss. . (Domingo and Mateo, 20 02) ) and later extended to categorical data (Torra, 20 04) (see also (Domingo-Ferrer and Torra, 20 05)) and to time series (Nin and Torra, 20 06). (Felso et al., 20 01) shows. rank swapping include (Carlson and Salabasis, 20 02) and (Takemura, 20 02) . 35 Privacy in Data Mining 699 Microaggregation Microaggregation was originally (Defays and Nanopoulos, 1993) defined for. Yao in 19 82 (Yao, 19 82) . For example, (Lindell and Pinkas, 20 00) and (Lindell and Pinkas, 20 02) defined a method based on cryptographic tools for computing a decision tree from two data sets

Định dạng
Số trang	10
Dung lượng	111,67 KB