1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 74 pptx

10 331 0

Đang tải... (xem toàn văn)


Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 372,75 KB

Nội dung

710 Vicenc¸ Torra one), the results are similar. Some parameterizations of rank swapping (Rank with parameter p in the Table) and microaggregation (Micmul with parameter k in the Table) are ranked in both (Domingo-Ferrer and Torra, 2001b) and here among the best algorithms. The comparison can be extended evaluating new masking methods and comparing them with the existing scores. For example, results from (Jimenez and Torra, 2009) would permit to include in this table (with a score lower than 40) some parameterizations of lossy compression using JPEG 2000. 35.6.2 R-U Maps (Duncan et al., 2001,Duncan et al., 2004) propose the R-U maps, for Risk-Utility maps. This is a graphical representation of the two measures. R for risk and U for utility. Figure 35.2 represents an R-U map for the methods listed in the previous section each with several parameterizations. Namely, RankXXX corresponds to Rank Swapping, MicXXX are variations of Microaggregation, JPEGXXX corresponds to Lossy Compression using JPEG, and RemuestX is resampling (not described in this chapter). In the figure, DR corresponds to the Disclosure Risk (R following the standard jargon of R-U maps), and IL to information loss (in our case computed as aPIL). Formally, IL and utility U are related as follows: 1 −U = IL. Note that in addition to the protection procedures represented in Table 35.1, the figure includes all the other methods analyzed in (Domingo-Ferrer and Torra, 2001b) but with the new measures DR and aPIL described above. In this figure, the lines represent scores of 50, 40, 30, and 20. Naturally, the nearer a method to (0,0), the better. 35.7 Conclusions In this chapter we have reviewed the major topics concerning privacy in data mining. We have rewiewed major protection methods, and discussed how to measure disclosure risk and information loss. Finally, some tools for visualizing such measures and for comparing the methods have been described. Acknowledgements Part of the research described in this chapter is supported by the Spanish MEC (projects ARES – CONSOLIDER INGENIO 2010 CSD2007-00004 – and eAEGIS – TSI2007-65406-C03- 02). References Adam, N. R., Wortmann, J. C. (1989) Security-control for statistical databases: a comparative study, ACM Computing Surveys, Volume: 21, 515-556. Aggarwal, C. (2005) On k-anonymity and the curse of dimensionality, Proceedings of the 31st International Conference on Very Large Databases, pages 901-909. Aggarwal, C. C., Yu, P. S. (2008) Privacy-Preserving Data Mining: Models and Algorithms, Springer. 35 Privacy in Data Mining 711 0 20406080100 0 20406080100 Risk/Utility Map DR IL Distr Remuest1 Remuest3 JPEG100 JPEG010 JPEG015 JPEG095 JPEG020 MicOI10 JPEG025 JPEG030 JPEG070 MicOI09 JPEG075 MicOI08 JPEG080 MicOI07 JPEG065 JPEG090 MicOI06 JPEG085 MicOI04 MicOI05 MicOI03 Adit0.01 Adit0.02 Mic2mul09 Rank01 JPEG055 JPEG050 Mic2mul10 JPEG035 Mic2mul06 Mic2mul05 Rank02 JPEG060 Mic2mul08 Adit0.04 Mic2mul07 Mic2mul03 Mic2mul04 JPEG045 JPEG040 Adit0.06 Adit0.08 Adit0.12 Adit0.16 Adit0.14 Rank03 Adit0.1 MicZ04 Rank04 MicZ03 Mic3mul09 MicZ08 Adit0.18 MicZ07 MicZ05 Mic3mul10 MicZ06 MicZ09 Mic3mul08 MicZ10 Mic3mul07 MicPCP10 MicPCP07 MicPCP09 Mic3mul03 MicPCP05 MicPCP08 Mic3mul04 Mic3mul06 Mic4mul10 Mic3mul05 MicPCP06 Adit0.2 MicPCP04 Mic4mul09 Mic4mul08 MicPCP03 Mic4mul06 Mic4mul05 Mic4mul07 Rank06 Mic4mul04 Mic4mul03 Rank05 Micmul10 Micmul07 Micmul09 Rank08 Micmul06 Micmul08 Micmul05 Micmul04 Micmul03 Rank07 Rank10 Rank09 Rank12 Rank11 Rank14 Rank13 Rank16 Rank18 Rank15 Rank17 Rank20 Rank19 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 35.2. R-U Maps for some protection methods. IL computed with PIL. Agrawal, R., Srikant, R. (2000) Privacy Preserving Data Mining, Proc. of the ACM SIGMOD Conference on Management of Data, 439-450. Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., Verykios, V. (1999) Disclosure lim- itation of sensitive rules, Proc. of IEEE Knowledge and Data Engineering Exchange Workshop (KDEX). Atzori, M., Bonchi, F., Giannotti, F., Pedreschi, D. (2008) Anonymity preserving pattern discovery, The VLDB Journal 17 703-727. Bacher, J., Brand, R., Bender, S. (2002) Re-identifying register data by survey data using cluster analysis: an empirical study, Int. J. of Unc., Fuzz. and Knowledge Based Systems 10:5 589-607. Bertino, E., Lin, D., Jiang, W. (2008) A survey of quantification of privacy preserving data mining algorithms, in C. C. Aggarwal, P. S. Yu (eds.) Privacy-Preserving Data Mining: 712 Vicenc¸ Torra Models and Algorithms, Springer, 183-205. Brand, R. (2002) Microdata protection through noise addition, in J. Domingo-Ferrer (ed.) Inference Control in Statistical Databases, Lecture Notes in Computer Science 2316 97- 116. Bunn, P., Ostrovsky, R. (2007) Secure two-party k-means clustering, Proc. of CCS’07, ACM Press, 486-497. Burridge, J. (2003) Information preserving statistical obfuscation, Statistics and Computing, 13:321–327. Carlson, M., Salabasis, M. (2002) A data swapping technique using ranks: a method for disclosure control, Research on Official Statistics 5:2 35-64. Dalenius, T. (1977) Towards a methodology for statistical disclosure control, Statistisk Tid- skrift 5 429-444. Dalenius, T. (1986) Finding a needle in a haystack - or identifying anonymous census records, Journal of Official Statistics 2:3 329-336. Defays, D., Nanopoulos, P. (1993) Panels of enterprises and confidentiality: the small aggre- gates method, Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, Statistics Canada, 195-204. Dempster, A. P., Laird, N. M., Rubin, D. B. (1977) Maximum Likelihood From Incomplete Data Via the EM Algorithm, Journal of the Royal Statistical Society 39 1-38. Domingo-Ferrer, J., Mateo-Sanz, J. M. (2002) Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. on Knowledge and Data Engineering 14:1 189-201. Domingo-Ferrer, J., Mateo-Sanz, J. M., Torra, V. (2001) Comparing SDC methods for mi- crodata on the basis of information loss and disclosure risk, Pre-proceedings of ETK- NTTS’2001, (Eurostat, ISBN 92-894-1176-5), Vol. 2, 807-826, Creta, Greece. Domingo-Ferrer, J., Sebe, F., Castella-Roca, J. (2004) On the security of noise addition for privacy in statistical databases, PSD 2004, Lecture Notes in Computer Science 3050 149-161. Domingo-Ferrer, J., Torra, V. (2001) Disclosure Control Methods and Information Loss for Microdata, in P. Doyle, J. I. Lane, J. J. M. Theeuwes, L. Zayatz (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier Science, 91-110. Domingo-Ferrer, J., Torra, V. (2001) A quantitative comparison of disclosure control meth- ods for microdata, in P. Doyle, J. I. Lane, J. J. M. Theeuwes, L. Zayatz (eds.) Confi- dentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, North-Holland, 111-134. Domingo-Ferrer, J., Torra, V. (2003) Disclosure Risk Assessment in Statistical Microdata Protection via advanced record linkage, Statistics and Computing, 13 343-354. Domingo-Ferrer, J., Torra, V. (2005) Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation, Data Mining and Knowledge Discovery 11:2 195-212. Duncan, G. T., Keller-McNulty, S. A., Stokes, S. L. (2001) Disclosure risk vs. data utility: The R-U confidentiality map, Technical Report 121, National Institute of Statistical Sci- ences. Duncan, G. T., Keller-McNulty, S. A., Stokes, S. L. (2001) Database security and confiden- tiality: examining disclosure risk vs. data utility through the R-U confidentiality map, Technical Report 142, National Institute of Statistical Sciences. Duncan, G. T., Lambert, D. (1986) Disclosure-limited data dissemination, Journal of the American Statistical Association, 81 10-18. 35 Privacy in Data Mining 713 Duncan, G. T., Lambert, D. (1989) The risk disclosure for microdata, Journal of Business and Economic Statistics 7 207-217. Elamir, E. A. H. (2004) Analysis of re-identification risk based on log-linear models, PSD 2004, Lecture Notes in Computer Science 3050 273-281. Elliot, M. (2002) Integrating file and record level disclosure risk assessment, in J. Domingo- Ferrer, Inference Control in Statistical Databases, Lecture Notes in Computer Science 2316 126-134. Elliot, M. J. Skinner, C. J., Dale, A. (1998) Special Uniqueness, Random Uniques and Sticky Populations: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk, Research in Official Statistics 1:2 53-67. Fellegi, I. P., Sunter, A. B. (1969) A theory for record linkage, Journal of the American Statistical Association 64:328 1183-1210. Fels ¨ o, F., Theeuwes, J., Wagner, G., (2001) Disclosure Limitation in Use: Results of a Survey, in P. Doyle, J. I. Lane, J. J. M. Theeuwes, L. Zayatz (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier Science, 17-42. Franconi, L., Polettini, S. (2004) Individual risk estimation in μ -Argus: a review, PSD 2004, Lecture Notes in Computer Science 3050 262-272. Gouweleeuw, J. M., Kooiman, P., Willenborg, L. C. R. J., De Wolf, P P. (1998) Post Ran- domisation for Statistical Disclosure Control: Theory and Implementation’, Journal of Official Statistics 14:4 463-478. Also as Research Paper No. 9731, Voorburg: Statistics Netherlands (1997). Gross, B., Guiblin, P., Merrett, K. (2004) Implementing the Post Randomisation method to the individual sample of anonymised records (SAR) from the 2001 Census, paper presented at “The Samples of Anonymised Records, An Open Meeting on the Samples of Anonymised Records from the 2001 Census”. http://www.ccsr.ac.uk/sars/events/2004- 09-30/gross.pdf Hansen, S., Mukherjee, S. (2003) A Polynomial Algorithm for Optimal Univariate Microag- gregation, IEEE Trans. on Knowledge and Data Engineering 15:4 1043-1044. Haritsa, J. R. (2008) Mining association rules under privacy constraints, in C. C. Aggarwal, P. S. Yu (eds.) Privacy-Preserving Data Mining: Models and Algorithms, Springer, 239- 266. Hundepool, A., van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, C., de Wolf, P P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. (2003) μ -ARGUS version 3.2 Software and User’s Manual, Voorburg NL,Statistics Netherlands, February, 2003; version 4.0 published on may 2005. http://neon.vb.cbs.nl/casc. Jaro, M. A. (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84:406 414- 420. Jim ´ enez, J., Torra, V. (2009) Utility and risk of JPEG-based continuous microdata protection methods, Proc. Int. Conf. on Availability, Reliability and Security (ARES 2009), 929- 934. Kantarcioglu, M. (2008) A survey of privacy-preserving methods across horizontally parti- tioned data, in C. C. Aggarwal, P. S. Yu (eds.) Privacy-Preserving Data Mining: Models and Algorithms, Springer, 313-335. Kim, J., Winkler, W. (2003) Multiplicative noise for masking continuous data, Research Report Series (Statistics 2003-01), U. S. Bureau of the Census. Kisilevich S., Rokach L., Elovici Y., Shapira B., Efficient Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 3, 714 Vicenc¸ Torra pp. 334-347, Mar. 2010 Ladra, S., Torra, V. (2008) On the comparison of generic information loss measures and cluster-specific ones, Intl. J. of Unc., Fuzz. and Knowledge-Based Systems, 16:1 107- 120. Lambert, D. (1993) Measures of Disclosure Risk and Harm, Journal of Official Statistics 9 313-331. LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2005) Multidimensional k-anonymity, Tech- nical Report 1521, University of Wisconsin. LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2005) Incognito: Efficient Full-Domain K- Anonymity, SIGMOD 2005. Li, N., Li, T., Venkatasubramanian, S. (2007) T-closeness: privacy beyond k-anonymity and l-diversity, Proc. of the IEEE ICDE 2007. Liew, C. K., Choi, U. J., Liew, C. J. (1985) A data distortion by probability distribution, ACM Transactions on Database Systems 10 395-411. Lindell, Y., Pinkas, B. (2002) Privacy Preserving Data Mining, Journal of Cryptology, 15:3. Lindell, Y., Pinkas, B. (2000) Privacy Preserving Data Mining, Crypto’00, Lecture Notes in Computer Science 1880 20-24. Liu, K., Kargupta, H., Ryan, J. (2006) Random projection based multiplicative data pertur- bation for privacy preserving data mining, IEEE Trans. on Knowledge and Data Engi- neering 18:1 92-106. Machanavajjhala, A., Gehrke, J., Kiefer, D., Venkitasubramanian, M. (2006) L-diversity: privacy beyond k-anonymity, Proc. of the IEEE ICDE. Mateo-Sanz, J. M., Domingo-Ferrer, J. Seb ´ e, F. (2005) Probabilistic information loss mea- sures in confidentiality protection of continuous microdata, Data Mining and Knowledge Discovery, 11:2 181-193. Moore, R. (1996) Controlled data swapping techniques for masking public use microdata sets, U. S. Bureau of the Census (unpublished manuscript). Muralidhar, K., Sarathy, R. (2008) Generating Sufficiency-based Non-Synthetic Perturbed Data, Transactions on Data Privacy 1:1 17 - 33 Nin, J., Herranz, J., Torra, V. (2007) Rethinking Rank Swapping to Decrease Disclosure Risk, Data and Knowledge Engineering, 64:1 346-364. Nin, J., Herranz, J., Torra, V. (2008) How to Group Attributes in Multivariate Microaggrega- tion, Intl. J. of Unc., Fuzz. and Knowledge-Based Systems, 16:1 121-138. Nin, J., Herranz, J., Torra, V. (2008) On the Disclosure Risk of Multivariate Microaggrega- tion, Data and Knowledge Engineering, 67:3 399-412. Nin, J., Herranz, J., Torra, V. (2008) Towards a More Realistic Disclosure Risk Assessment, Lecture Notes in Computer Science, 5262 152-165. Nin, J. Torra, V. (2006) Extending microaggregation procedures for time series protection, Lecture Notes in Artificial Intelligence, 4259 899-908. Nin, J., Torra, V. (2009) Analysis of the Univariate Microaggregation Disclosure Risk, New Generation Computing, 27 177-194. Oganian, A., Domingo-Ferrer, J. (2000) On the Complexity of Optimal Microaggregation for Statistical Disclosure Control, Statistical J. United Nations Economic Commission for Europe, 18, 4, 345-354. Paass, G. (1985) Disclosure risk and disclosure avoidance for microdata, Journal of Business and Economic Statistics 6 487-500. Paass, G., Wauschkuhn, U. (1985) Datenzugang, Datenschutz und Anonymisierung - Anal- ysepotential und Identifizierbarkeit von Anonymisierten Individualdaten, Oldenbourg Verlag. 35 Privacy in Data Mining 715 Pagliuca, D., Seri, G. (1999) Some results of individual ranking method on the system of enterprise accounts annual survey, Esprit SDC Project, Deliverable MI-3/D2. Pinkas, B. (2002) Cryptographic techniques for privacy-preserving data mining, ACM SIGKDD Explorations 4:2. Ravikumar, P., Cohen, W. W. (2004) A hierarchical graphical model for record linkage, Proc. of UAI 2004. Rokach L., Genetic algorithm-based feature set partitioning for classification prob- lems,Pattern Recognition, 41(5):1676–1700, 2008. Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap- proach, Proceedings of the 14th International Symposium On Methodologies For Intel- ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31. Samarati, P. (2001) Protecting Respondents’ Identities in Microdata Release, IEEE Trans. on Knowledge and Data Engineering, 13:6 1010-1027. Samarati, P., Sweeney, L. (1998) Protecting privacy when disclosing information: k- anonymity and its enforcement through generalization and suppression, SRI Intl. Tech. Rep. Spruill, N. L. (1983) The confidentiality and analytic usefulness of masked business mi- crodata, Proc. of the Section on Survery Research Methods 1983, American Statistical Association, 602-610. Sweeney, L. (2002) Achieving k-anonymity privacy protection using generalization and sup- pression, Int. J. of Unc., Fuzz. and Knowledge Based Systems 10:5 571-588. Sweeney, L. (2002) k-anonymity: a model for protecting privacy, Int. J. of Unc., Fuzz. and Knowledge Based Systems 10:5 557-570. Takemura, A. (2002) Local recoding and record swapping by maximum weight matching for disclosure control of microdata sets, Journal of Official Statistics 18 275-289. Preprint (1999) Local recoding by maximum weight matching for disclosure control of microdata sets. Templ, M. (2008) Statistical Disclosure Control for Microdata Using the R-Package sdcMi- cro, Transactions on Data Privacy 1 67-85. Torra, V. (2004) Microaggregation for categorical variables: a median based approach, Proc. Privacy in Statistical Databases (PSD 2004), Lecture Notes in Computer Science 3050 162-174. Torra, V. (2004) OWA operators in data modeling and reidentification, IEEE Trans. on Fuzzy Systems 12:5 652-660. Torra, V. (2008) Constrained Microaggregation: Adding Constraints for Data Editing, Trans- actions on Data Privacy 1:2 86-104. Torra, V., Abowd, J. M., Domingo-Ferrer, J. (2006) Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment, Lecture Notes in Computer Science 4302 233-242. Torra, V., Domingo-Ferrer, J. (2003) Record linkage methods for multidatabase data mining, in V. Torra (ed.) Information Fusion in Data Mining, Springer, 101-132. Torra, V., Miyamoto, S. (2004) Evaluating fuzzy clustering algorithms for microdata protec- tion, PSD 2004, Lecture Notes in Computer Science 3050 175-186. Trottini, M. (2003) Decision models for data disclosure limitation, PhD Dissertation, Carnegie Mellon University. http://www.niss.org/dgii/TR/Thesis-Trottini-final.pdf Truta, T. M., Vinay, B. (2006) Privacy protection: p-sensitive k-anonymity property. Proc. 2nd Int. Workshop on Privacy Data management (PDM 2006) p. 94. 716 Vicenc¸ Torra Willenborg, L., de Waal, T. (2001) Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Springer-Verlag. Winkler, W. E. (1993) Matching and record linkage, Statistical Research Division, U. S. Bureau of the Census (USA), RR93/08. Winkler, W. E. (2004) Re-identification methods for masked microdata, PSD 2004, Lecture Notes in Computer Science 3050 216-230. Yancey, W. E., Winkler, W. E., Creecy, R. H. (2002) Disclosure risk assessment in pertur- bative microdata protection, in J. Domingo-Ferrer (ed.) Inference Control in Statistical Databases, Lecture Notes in Computer Science 2316 135-152. Yao, A. C. (1982) Protocols for Secure Computations, Proc. of 23rd IEEE Symposium on Foundations of Computer Science, Chicago, Illinois, 160-164. http://www.census.gov 36 Meta-Learning - Concepts and Techniques Ricardo Vilalta 1 , Christophe Giraud-Carrier 2 , and Pavel Brazdil 3 1 University of Houston 2 Brigham Young University 3 University of Porto Summary. The field of meta-learning has as one of its primary goals the understanding of the interaction between the mechanism of learning and the concrete contexts in which that mech- anism is applicable. The field has seen a continuous growth in the past years with interesting new developments in the construction of practical model-selection assistants, task-adaptive learners, and a solid conceptual framework. In this chapter we give an overview of different techniques necessary to build meta-learning systems. We begin by describing an idealized meta-learning architecture comprising a variety of relevant component techniques. We then look at how each technique has been studied and implemented by previous research. In ad- dition we show how meta-learning has already been identified as an important component in real-world applications. Key words: Meta-learning 36.1 Introduction We are used to thinking of a learning system as a rational agent capable of adapting to a specific environment by exploiting knowledge gained through experience; encountering multiple and diverse scenarios sharpens the ability of the learning system to predict the effect produced from selecting a particular course of action. In this case, learning is made manifest because the quality of the predictions normally improves with an increasing number of scenarios or examples. Nevertheless, if the predictive mechanism were to start afresh on different tasks, the learning system would find itself at a considerable disadvantage; learning systems capable of modifying their own predictive mechanism would soon outperform our base learner by being able to change their learning strategy according to the characteristics of the task under analysis. Meta-learning differs from base-learning in the scope of the level of adaptation; whereas learning at the base-level is based on accumulating experience on a specific learning task (e.g., credit rating, medical diagnosis, mine-rock discrimination, fraud detection, etc.), learning at the meta-level is based on accumulating experience on the performance of multiple applica- tions of a learning system. If a base-learner fails to perform efficiently, one would expect the O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_36, © Springer Science+Business Media, LLC 2010 718 Ricardo Vilalta, Christophe Giraud-Carrier, and Pavel Brazdil learning mechanism itself to adapt in case the same task is presented again. Meta-learning is then important in understanding the interaction between the mechanism of learning and the concrete contexts in which that mechanism is applicable. Briefly stated, the field of meta- learning is focused on the relation between tasks or domains and learning strategies. In that sense, by learning or explaining what causes a learning system to be successful or not on a particular task or domain, we go beyond the goal of producing more accurate learners to the additional goal of understanding the conditions (e.g., types of example distributions) under which a learning strategy is most appropriate. From a practical stance, meta-learning can solve important problems in the application of machine learning and Data Mining tools, particularly in the area of classification and regres- sion. First, the successful use of these tools outside the boundaries of research (e.g., industry, commerce, government) is conditioned on the appropriate selection of a suitable predictive model (or combinations of models) according to the domain of application. Without any kind of assistance, model selection and combination can turn into stumbling blocks to the end-user who wishes to access the technology more directly and cost-effectively. End-users often lack not only the expertise necessary to select a suitable model, but also the availability of many models to proceed on a trial-and-error basis (e.g., by measuring accuracy via some re-sampling technique such as n-fold cross-validation). A solution to this problem is attainable through the construction of meta-learning systems. These systems can provide automatic and systematic user guidance by mapping a particular task to a suitable model (or combination of models). Second, a problem commonly observed in the practical use of ML and DM tools is how to profit from the repetitive use of a predictive model over similar tasks. The successful ap- plication of models in real-world scenarios requires a continuous adaptation to new needs. Rather than starting afresh on new tasks, we expect the learning mechanism itself to re-learn, taking into account previous experience (Thrun, 1998,Pratt et al., 1991,Caruana, 1997,Vilalta and Drissi, 2002). Again, meta-learning systems can help control the process of exploiting cumulative expertise by searching for patterns across tasks. Our goal in this chapter is to give an overview of different techniques necessary to build meta-learning systems. To impose some structure, we begin by describing an idealized meta- learning architecture comprising a variety of relevant component techniques. We then look at how each technique has been studied and implemented by previous research. We hope that by proceeding in this way the reader can not only learn from past work, but in addition gain some insight on how to construct meta-learning systems. We also hope to show how recent advances in meta-learning are increasingly filling the gaps in the construction of practical model-selection assistants and task-adaptive learners, as well as in the development of a solid conceptual framework (Baxter, 1998, Baxter, 2000, Giraud-Carrier et al., 2004). This chapter is organized as follows. In the next section we illustrate an idealized meta- learning architecture and detail on its constituent parts. In Section 65.3.3 we describe previous research in meta-learning and its relation to our architecture. Section 65.3.4 describes a meta- learning tool that has been instrumental as a decision support tool in real applications. Lastly, section 65.3.5 discusses future directions and provides our conclusions. 36.2 A Meta-Learning Architecture In this section we provide a general view of a software architecture that will be used as a reference to describe many of the principles and current techniques in meta-learning. Though 36 Meta-Learning 719 not every technique in meta-learning fits into this architecture, such a general view helps us understand the challenges we need to overcome before we can turn the technology into a set of useful and practical tools. 36.2.1 Knowledge-Acquisition Mode To begin, we propose a meta-learning system that divides into two modes of operation. During the first mode, also known as the knowledge-acquisition mode, the main goal is to learn about the learning process itself. Figure 36.1 illustrates this mode of operation. We assume the input to the system is made of more than one dataset of examples (e.g., more than one set of pairs of feature vectors and classes; Figure 36.1A). Upon arrival of each dataset, the meta-learning system invokes a component responsible for extracting dataset characteristics or meta-features (Figure 36.1B). The goal of this component is to gather information that transcends the par- ticular domain of application. We look for information that can be used to generalize to other example distributions. Section 36.3.1 details current research pointing in this direction. During the knowledge acquisition mode, the learning technique (Figure 36.1C) does not exploit knowledge across different datasets or tasks. Each dataset is considered independently of the rest; the output to the system is a learning strategy (e.g., a classifier or combination of classifiers, Figure 36.1D). Statistics derived from the output model or its performance (Figure 36.1E) may also serve as a form of characterizing the task under analysis (Sections 36.3.1 and 36.3.1). Information derived from the meta-feature generator and the performance evaluation mod- ule can be combined into a meta-knowledge base (Figure 36.1F). This knowledge base is the main result of the knowledge–acquisition phase; it reflects experience accumulated across different tasks. Meta-learning is tightly linked to the process of acquiring and exploiting meta- knowledge. One can even say that advances in the field of meta-learning hinge around one specific question: how can we acquire and exploit knowledge about learning systems (i.e., meta-knowledge) to understand and improve their performance? As we describe current re- search in meta-learning we will be pointing out to different forms of meta-knowledge. 36.2.2 Advisory Mode The efficiency of the meta-learner increases as it accumulates meta-knowledge. We assume the lack of experience at the beginning of the learner’s life compels the meta-learner to use one or more learning strategies without a clear preference for one of them; experimenting with many different strategies becomes time consuming. However, as more training sets have been examined, we expect the expertise of the meta-learner to dominate in deciding which learning strategy best suits the characteristics of the training set. In the advisory mode, meta-knowledge acquired in the exploratory mode is used to con- figure the learning system in a manner that exploits the characteristics of the new data distri- bution. Meta-features extracted from the dataset (Figure 36.2B) are matched with the meta- knowledge base (Figure 36.2F) to produce a recommendation regarding the best available learning strategy. At this point we move away from the use of static base learners to the ability to do model selection or combining base learners (Figure 36.2C). Two observations are worth considering at this point. First, the nature of the match be- tween the set of meta-features and the meta-knowledge base can have several interpretations. The traditional view poses this problem as a learning problem itself where a meta-learner is invoked to output an approximating function mapping meta-features to learning strategies . Continuous and Heterogeneous k-Anonymity Through Microaggregation, Data Mining and Knowledge Discovery 11 :2 195 -21 2. Duncan, G. T., Keller-McNulty, S. A., Stokes, S. L. (20 01) Disclosure risk vs. data. Science 43 02 233 -24 2. Torra, V., Domingo-Ferrer, J. (20 03) Record linkage methods for multidatabase data mining, in V. Torra (ed.) Information Fusion in Data Mining, Springer, 101-1 32. Torra,. (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_36, © Springer Science+Business Media, LLC 20 10 718 Ricardo Vilalta, Christophe Giraud-Carrier, and

Ngày đăng: 04/07/2014, 05:21