1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 71 potx

10 198 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 381,32 KB

Nội dung

680 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas δ ( λ )=  1if λ /∈Y i 0 otherwise Coverage evaluates how far we need, on average, to go down the ranked list of labels in order to cover all the relevant labels of the example. Cov = 1 m m ∑ i=1 max λ ∈Y i r i ( λ ) −1 Ranking loss expresses the number of times that irrelevant labels are ranked higher than relevant labels: R-Loss = 1 m m ∑ i=1 1 |Y i ||Y i | |{( λ a , λ b ) : r i ( λ a ) > r i ( λ b ),( λ a , λ b ) ∈Y i ×Y i }| where Y i is the complementary set of Y i with respect to L. Average precision evaluates the average fraction of labels ranked above a particular label λ ∈Y i which actually are in Y i . AvgPrec = 1 m m ∑ i=1 1 |Y i | ∑ λ ∈Y i |{ λ  ∈Y i : r i ( λ  ) ≤ r i ( λ )}| r i ( λ ) 34.7.3 Hierarchical The hierarchical loss (Cesa-Bianchi et al., 2006b) is a modified version of the Hamming loss that takes into account an existing hierarchical structure of the labels. It examines the predicted labels in a top-down manner according to the hierarchy and whenever the prediction for a label is wrong, the subtree rooted at that node is not considered further in the calculation of the loss. Let anc( λ ) be the set of all the ancestor nodes of λ . The hierarchical loss is defined as follows: H-Loss = 1 m m ∑ i=1 |{ λ : λ ∈Y i Z i ,anc( λ ) ∩(Y i Z i )=/0}| Several other measures for hierarchical (multi-label) classification are examined in (Moskovitch et al., 2006,Sun & Lim, 2001). 34.8 Related Tasks One of the most popular supervised learning tasks is multi-class classification, which involves a set of labels L, where |L|> 2. The critical difference with respect to multi-label classification is that each instance is associated with only one element of L, instead of a subset of L. Jin and Ghahramani (Jin & Ghahramani, 2002) call multiple-label problems, the semi- supervised classification problems where each example is associated with more than one classes, but only one of those classes is the true class of the example. This task is not that common in real-world applications as the one we are studying. Multiple-instance or multi-instance learning is a variation of supervised learning, where labels are assigned to bags of instances (Maron & p Erez, 1998). In certain applications, the training data can be considered as both multi-instance and multi-label (Zhou, 2007). In image classification for example, the different regions of an image can be considered as multiple- instances, each of which can be labeled with a different concept, such as sunset and sea. 34 Mining Multi-label Data 681 Several methods have been recently proposed for addressing such data (Zhou & Zhang, 2006, Zha et al., 2008). In Multitask learning (Caruana, 1997) we try to solve many similar tasks in parallel usu- ally using a shared representation. Taking advantage of the common characteristics of these tasks a better generalization can be achieved. A typical example is to learn to identify hand written text for different writers in parallel. Training data from one writer can aid the construc- tion of better predictive models for other authors. 34.9 Multi-Label Data Mining Software There exists a number of implementations of specific algorithms for mining multi-label data, most of which have been discussed in Section 34.2.2. The BoosTexter system 6 , implements the boosting-based approaches proposed in (Schapire, 2000). There also exist Matlab imple- mentations for MLkNN 7 and BPMLL 8 . There are also more general-purpose software that handle multi-label data as part of their functionality. LibSVM (Chang & Lin, 2001) is a library for support vector machines that can learn from multi-label data using the binary relevance transformation. Clus 9 is a predictive clustering system that is based on decision tree learning. Its capabilities include (hierarchical) multi-label classification. Finally, Mulan 10 is an open-source software devoted to multi-label data mining. It in- cludes implementations of a large number of learning algorithms, basic capabilities for di- mensionality reduction and hierarchical multi-label classification and an extensive evaluation framework. References Barutcuoglu, Z., Schapire, R. E. & Troyanskaya, O. G. (2006). Bioinformatics 22, 830–836. Blockeel, H., Schietgat, L., Struyf, J., Dz?eroski, S. & Clare, A. (2006). Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4213 LNAI, 18–29. Boleda, G., im Walde, S. S. & Badia, T. (2007). In Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning pp. 171–180,, Prague. Boutell, M., Luo, J., Shen, X. & Brown, C. (2004). Pattern Recognition 37, 1757–1771. Brinker, K., F ¨ urnkranz, J. & H ¨ ullermeier, E. (2006). In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI ’06) pp. 489–493,, Riva del Garda, Italy. Brinker, K. & H ¨ ullermeier, E. (2007). In Proceedings of the 20th International Conference on Artificial Intelligence (IJCAI ’07) pp. 702–707,, Hyderabad, India. Caruana, R. (1997). Machine Learning 28, 41–75. 6 http://www.cs.princeton.edu/ schapire/boostexter.html 7 http://lamda.nju.edu.cn/datacode/MLkNN.htm 8 http://lamda.nju.edu.cn/datacode/BPMLL.htm 9 http://www.cs.kuleuven.be/ dtai/clus/ 10 http://sourceforge.net/projects/mulan/ 682 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. (2006a). In ICML ’06: Proceedings of the 23rd international conference on Machine learning pp. 177–184,. Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. (2006b). Journal of Machine Learning Research 7, 31–54. Chang, C C. & Lin, C J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ ˜ cjlin/libsvm. Chawla, N. V., Japkowicz, N. & Kotcz, A. (2004). SIGKDD Explorations 6, 1–6. Chen, W., Yan, J., Zhang, B., Chen, Z. & Yang, Q. (2007). In Proc. 7th IEEE International Conference on Data Mining pp. 451–456, IEEE Computer Society, Los Alamitos, CA, USA. Clare, A. & King, R. (2001). In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001) pp. 42–53,, Freiburg, Germany. Crammer, K. & Singer, Y. (2003). Journal of Machine Learning Research 3, 1025–1058. de Comite, F., Gilleron, R. & Tommasi, M. (2003). In Proceedings of the 3rd International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2003) pp. 35–49,, Leipzig, Germany. Diplaris, S., Tsoumakas, G., Mitkas, P. & Vlahavas, I. (2005). In Proceedings of the 10th Panhellenic Conference on Informatics (PCI 2005) pp. 448–456,, Volos, Greece. Elisseeff, A. & Weston, J. (2002). In Advances in Neural Information Processing Systems 14. Esuli, A., Fagni, T. & Sebastiani, F. (2008). Information Retrieval 11, 287–313. F ¨ urnkranz, J., H ¨ ullermeier, E., Mencia, E. L. & Brinker, K. (2008). Machine Learning . Gao, S., Wu, W., Lee, C H. & Chua, T S. (2004). In Proceedings of the 21st international conference on Machine learning (ICML ’04) p. 42,, Banff, Alberta, Canada. Ghamrawi, N. & McCallum, A. (2005). In Proceedings of the 2005 ACM Conference on Information and Knowledge Management (CIKM ’05) pp. 195–200,, Bremen, Germany. Godbole, S. & Sarawagi, S. (2004). In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004) pp. 22–30,. Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G. M., Blake, J. A., Bult, C., Dolan, M., Drabkin, H., Eppig, J. T., Hill, D. P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J. M., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G., Hirschman, J. E., Hong, E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Bot- stein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S. Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E. M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Ber- riman, M., Wood, V., de La, Tonellato, P., Jaiswal, P., Seigfried, T. & White, R. (2004). Nucleic Acids Res 32. H ¨ ullermeier, E., F ¨ urnkranz, J., Cheng, W. & Brinker, K. (2008). Artificial Intelligence 172, 1897–1916. Ji, S., Tang, L., Yu, S. & Ye, J. (2008). In Proceedings of the 14th SIGKDD International Conferece on Knowledge Discovery and Data Mining, Las Vegas, USA. Jin, R. & Ghahramani, Z. (2002). In Proceedings of Neural Information Processing Systems 2002 (NIPS 2002), Vancouver, Canada. Katakis, I., Tsoumakas, G. & Vlahavas, I. (2008). In Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium. Kohavi, R. & John, G. H. (1997). Artificial Intelligence 97, 273–324. Lewis, D. D., Yang, Y., Rose, T. G. & Li, F. (2004). J. Mach. Learn. Res. 5, 361–397. 34 Mining Multi-label Data 683 Li, T. & Ogihara, M. (2003). In Proceedings of the International Symposium on Music Information Retrieval pp. 239–240,, Washington D.C., USA. Li, T. & Ogihara, M. (2006). IEEE Transactions on Multimedia 8, 564–574. Loza Mencia, E. & F ¨ urnkranz, J. (2008a). In 2008 IEEE International Joint Conference on Neural Networks (IJCNN-08) pp. 2900–2907,, Hong Kong. Loza Mencia, E. & F ¨ urnkranz, J. (2008b). In 12th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2008 pp. 50–65,, Antwerp, Bel- gium. Luo, X. & Zincir-Heywood, A. (2005). In Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems pp. 161–169,. Maron, O. & p Erez, T. A. L. (1998). In Advances in Neural Information Processing Systems 10 pp. 570–576, MIT Press. McCallum, A. (1999). In Proceedings of the AAAI’ 99 Workshop on Text Learning. Mencia, E. L. & F ¨ urnkranz, J. (2008). In 12th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2008, Antwerp, Belgium. Moskovitch, R., Cohenkashi, S., Dror, U., Levy, I., Maimon, A. & Shahar, Y. (2006). Artifi- cial Intelligence in Medicine 37, 177–190. Park, C. H. & Lee, M. (2008). Pattern Recogn. Lett. 29, 878–887. Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D. J., Johnson, N., Cohen, K. B. & Duch, W. (2007). In BioNLP ’07: Proceedings of the Workshop on BioNLP 2007 pp. 97–104, Association for Computational Linguistics, Morristown, NJ, USA. Qi, G J., Hua, X S., Rui, Y., Tang, J., Mei, T. & Zhang, H J. (2007). In MULTIMEDIA ’07: Proceedings of the 15th international conference on Multimedia pp. 17–26, ACM, New York, NY, USA. Read, J. (2008). In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008) pp. 143–150,. Rokach L., Genetic algorithm-based feature set partitioning for classification prob- lems,Pattern Recognition, 41(5):1676–1700, 2008. Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo- sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap- proach, Proceedings of the 14th International Symposium On Methodologies For Intel- ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31. Rousu, J., Saunders, C., Szedmak, S. & Shawe-Taylor, J. (2006). Journal of Machine Learn- ing Research 7, 1601–1626. Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I., G ¨ uldener, U., Mannhaupt, G., M ¨ unsterk ¨ otter, M. & Mewes, H. W. (2004). Nucleic Acids Res 32, 5539–5545. Schapire, R.E. Singer, Y. (2000). Machine Learning 39, 135–168. Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J M. & Smeulders, A. W. M. (2006). In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international con- ference on Multimedia pp. 421–430, ACM, New York, NY, USA. Spyromitros, E., Tsoumakas, G. & Vlahavas, I. (2008). In Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008). Srivastava, A. & Zane-Ulman, B. (2005). In IEEE Aerospace Conference. Streich, A. P. & Buhmann, J. M. (2008). In 12th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2008, Antwerp, Belgium. 684 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas Sun, A. & Lim, E P. (2001). In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining pp. 521–528, IEEE Computer Society, Washington, DC, USA. Sun, L., Ji, S. & Ye, J. (2008). In Proceedings of the 14th SIGKDD International Conferece on Knowledge Discovery and Data Mining, Las Vegas, USA. Thabtah, F., Cowling, P. & Peng, Y. (2004). In Proceedings of the 4th IEEE International Conference on Data Mining, ICDM ’04 pp. 217–224,. Trohidis, K., Tsoumakas, G., Kalliris, G. & Vlahavas, I. (2008). In Proc. 9th International Conference on Music Information Retrieval (ISMIR 2008), Philadelphia, PA, USA, 2008. Tsoumakas, G. & Katakis, I. (2007). International Journal of Data Warehousing and Mining 3, 1–13. Tsoumakas, G., Katakis, I. & Vlahavas, I. (2008). In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08) pp. 30–44,. Tsoumakas, G. & Vlahavas, I. (2007). In Proceedings of the 18th European Conference on Machine Learning (ECML 2007) pp. 406–417,, Warsaw, Poland. Ueda, N. & Saito, K. (2003). Advances in Neural Information Processing Systems 15 , 721–728. Veloso, A., Wagner, M. J., Goncalves, M. & Zaki, M. (2007). In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007) vol. LNAI 4702, pp. 605–612, Springer, Warsaw, Poland. Vembu, S. & G ¨ artner, T. (2009). In Preference Learning, (F ¨ urnkranz, J. & H ¨ ullermeier, E., eds),. Springer. Vens, C., Struyf, J., Schietgat, L., D ˇ zeroski, S. & Blockeel, H. (2008). Machine Learning 73, 185–214. Wieczorkowska, A., Synak, P. & Ras, Z. (2006). In Proceedings of the 2006 International Conference on Intelligent Information Processing and Web Mining (IIPWM’06) pp. 307–315,. Wolpert, D. (1992). Neural Networks 5, 241–259. Yang, S., Kim, S K. & Ro, Y. M. (2007). Circuits and Systems for Video Technology, IEEE Transactions on 17, 324–335. Yang, Y. (1999). Journal of Information Retrieval 1, 67–88. Yang, Y. & Pedersen, J. O. (1997). In Proceedings of ICML-97, 14th International Confer- ence on Machine Learning, (Fisher, D. H., ed.), pp. 412–420, Morgan Kaufmann Pub- lishers, San Francisco, US, Nashville, US. Yu, K., Yu, S. & Tresp, V. (2005). In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval pp. 258– 265, ACM Press, Salvador, Brazil. Zha, Z J., Hua, X S., Mei, T., Wang, J., Qi, G J. & Wang, Z. (2008). In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on pp. 1–8,. Zhang, M L. & Zhou, Z H. (2006). IEEE Transactions on Knowledge and Data Engineering 18, 1338–1351. Zhang, M L. & Zhou, Z H. (2007a). Pattern Recognition 40, 2038–2048. Zhang, M L. & Zhou, Z H. (2007b). In Proceedings of the Twenty-Second AAAI Confer- ence on Artificial Intelligence pp. 669–674, AAAI Press, Vancouver, Britiths Columbia, Canada. Zhang, Y., Burer, S. & Street, W. N. (2006). Journal of Machine Learning Research 7, 1315–1338. 34 Mining Multi-label Data 685 Zhang, Y. & Zhou, Z H. (2008). In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008 pp. 1503–1505, AAAI Press, Chicago, Illinois, USA. Zhou, Z H. (2007). In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications (ADMA’07) p. 1. Springer. Zhou, Z. H. & Zhang, M. L. (2006). In NIPS, (Sch ¨ olkopf, B., Platt, J. C. & Hoffman, T., eds), pp. 1609–1616, MIT Press. Zhu, S., Ji, X., Xu, W. & Gong, Y. (2005). In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in Information Retrieval pp. 274– 281. 35 Privacy in Data Mining Vicenc¸ Torra IIIA - CSIC, Campus UAB s/n, 08193 Bellaterra Catalonia, Spain vtorra@iiia.csic.es Summary. In this chapter we describe the main tools for privacy in data mining. We present an overview of the tools for protecting data, and then we focus on protection procedures. Information loss and disclosure risk measures are also described. 35.1 Introduction Data is nowadays gathered in large amounts by companies and national offices. This data is often analyzed either using statistical methods or data mining ones. When such methods are applied within the walls of the company that has gathered them, the danger of disclosure of sensitive information might be limited. In contrast, when the analysis have to be performed by third parties, privacy becomes a much more relevant issue. To make matters worst, it is not uncommon the scenario where an analysis does not only require data from a single data source, but from several data sources. This is the case of banks looking for fraud detection and hospitals analyzing deseases and treatments. In the first case, data from several banks might help on fraud detection. Similarly, data from different hospitals might help on the process of finding the causes of a bad response to a given treatment, or the causes of a given desease. Privacy-Preserving Data Mining (Aggarwal and Yu, 2008) (PPDM) and Statistical Dis- closure Control (Willenborg, 2001, Domingo-Ferrer and Torra, 2001a) (SDC) are two related fields with a similar interest on ensuring data privacy. Their goal is to avoid the disclosure of sensitive or proprietary information to third parties. Within these fields, several methods have been proposed for processing and analysing data without compromising privacy, for releasing data ensuring some levels of data privacy; measures and indices have been defined for evaluating disclosure risk (that is, in what extent data satisfy the privacy constraints), and data utility or information loss (that is, in what extent the protected data is still useful for applications). In addition, tools have been proposed to visualize and compare different approaches for data protection. In this chapter we will review some of the existing methods and give an overview of the measures. The structure of the chapter is as follows. In Section 35.2, we present a classifi- cation of protection procedures. In Section 35.3, we review different interpretations for risk and give an overview of disclosure risk measures. In Section 35.4, we present major protection procedures. Also in this section we review k-anonymity. Then, Section 35.5 is focused on how O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_35, © Springer Science+Business Media, LLC 2010 688 Vicenc¸ Torra to measure data utility and information loss. A few information loss measures are reviewed there. The chapter finishes in Section 35.6 presenting different approaches for visualizing the trade-off between risk and utility, or risk and information loss. Some conclusions close the chapter. 35.2 On the Classification of Protection Procedures The literature on Privacy Preserving Data Mining (PPDM) and on Statistical Disclosure Con- trol (SDC) is vast, and a large number of procedures for ensuring privacy have been proposed. We classify them in two categories according to the prior knowledge the data owner has about the usage of the data. Data-driven or general purpose protection procedures. In this case, no specific analysis or usage is foreseen for the data. The data owner does not know what kind of analysis will be performed by the third party. This is the case when data is released for public use, as there is no way to know what kind of study a potential user will perform. This situation is common in National Statistical Offices, where data obtained from census and questionnaires can be e.g. downloaded from internet (census.gov). A similar case can occur for other public offices that publish regularly data obtained from questionnaires. Another case is when data are transferred to e.g. researchers so that they can analyse them. Hospitals and other healthcare institutions can also be the target of such protection procedures, as they can be interested in protection procedures that permit different researchers to apply different data analysis tools (e.g., regression, clustering, association rules). Within data-driven procedures, subcategories can be distinguished according to the type of data used. The main distinction about data types is between original datafiles (e.g., individuals described in terms of attributes) and aggregates of the data (e.g., contingency tables). In the statistical disclosure control community, the former type corresponds to microdata and the later to tabular data. With respect to the type or structure of the original files, most of the research has been fo- cused on standard files with numerical or categorical data (ordinal or nominal categorical data). Nevertheless, other more complex types of data have also been considered in the literature, as, e.g., multirelational databases, logs, and social networks. Another aspect to be considered in relation to the structure of the files is about the constraints that the pro- tected data needs to satisfy (e.g., when there is a linear combination of some variables). Data protection methods need to consider such constraints so that the protected data also satisfies them (see e.g. (Torra, 2008) for details on a classification of the constraints and a study of microaggregation under this light). Computation-driven or specific purpose protection procedures. In this case it is known be- forehand which type of analysis has to be applied to the data. As the data uses are known, protection procedures are defined according to the intented subsequent computa- tion. Thus, protection procedures are tailored to a specific purpose. This will be the case of a retailer with a commercial database with information on cus- tomers having a fidelity card, when such data has to be transferred to a third party for market basket analysis. For example, there exist tailored procedures for data protection for association rules. They can be applied in this context of market basket analysis. Results-driven protection procedures. In this case, privacy concerns to the result of applying a particular data mining method to some particular data (Atallah et al., 1999,Atzori et al., 35 Privacy in Data Mining 689 2008). For example, the association rules obtained from a commercial database should not permit the disclosure of sensitive information about particular customers. Although this class of procedures can be seen as computation-driven, they are important enough to deserve their own class. This class of methods are also known by anonymity preserving pattern discovery (Atzori et al., 2008), result privacy (Bertino et al., 2008), and output secrecy (Haritsa, 2008). Other dimensions have been considered in the literature for classifying protection proce- dures. One of them concerns the number of data sources. Single data source. The data analysis only requires data from a single source. Multiple data sources. Data from different sources have to be combined in order to compute a certain analysis. The analysis of data protection procedures for multiple data sources usually falls within the computation-driven approach. A typical scenario in this setting is when a few com- panies collaborate in a certain analysis, each one providing its own data base. In the typical scenario within data privacy, data owners want to compute such analysis without disclosing their own data to the other data owners. So, the goal is that at the end of the analysis the only additional information obtained by each of the data owners is the result of the analysis itself. That is, no extra knowledge should be acquired while computing the analysis. A trivial approach for solving this problem is to consider a trusted third party (TTP) that computes the analysis. This is the centralized approach. In this case, data is just transferred using a completely secure channel (i.e., using cryptographic protocols). In contrast, in distributed privacy preserving data mining, data owners compute the analysis in a collaborative manner. In this way, the trusted third party is not needed. For such computation, cryptographic tools are also used. Multiple data sources for data-driven protection procedures has limited interest. Each data owner can publish its own data protected using general purpose protection procedures, and then data can be linked (using e.g. record linkage algorithms) and finally analysed. So, this roughly corresponds to multidatabase mining. The literature often classifies protection procedures using another dimension concerning the type of tools used. That is, methods are classified either as following the perturbative or the cryptographic approach. Our classification given above encompasses these two ap- proaches. General purpose protection procedures follow the so-called perturbative approach, while computation-driven protection procedures mainly follow the cryptographic approach. Note, however, that there are some papers on perturbative approaches as e.g. noise addi- tion for specific uses as e.g. association rules (see (Atallah et al., 1999)). Nevertheless, such methods are general enough to be used in other applications. So, they are general purpose protection procedures. In addition, it is important to underline that, in this chapter, we will not use the term perturbative approach with the interpretation above. Instead, we will use the term perturbative methods/approaches in a more restricteed way (see Section 35.4), as it is usual in the statistical disclosure control community. In the rest of this section we further discuss both computation-driven and data-driven procedures. . (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_35, © Springer Science+Business Media, LLC 20 10 688 Vicenc¸ Torra to measure data utility and information. International Conferece on Knowledge Discovery and Data Mining, Las Vegas, USA. Jin, R. & Ghahramani, Z. (20 02) . In Proceedings of Neural Information Processing Systems 20 02 (NIPS 20 02) , Vancouver, Canada. Katakis,. on Principles of Data Mining and Knowledge Discovery (PKDD 20 01) pp. 42 53,, Freiburg, Germany. Crammer, K. & Singer, Y. (20 03). Journal of Machine Learning Research 3, 1 025 –1058. de Comite,

Ngày đăng: 04/07/2014, 05:21

TỪ KHÓA LIÊN QUAN