Data Mining and Knowledge Discovery Handbook, 2 Edition part 105 pot

1020 Antonio Congiusta, Domenico Talia, and Paolo Trunfio whose main goal is to design, develop and implement an infrastructure to effectively support scientific knowledge discovery processes from high-throughput informatics. In this context, a series of testbeds and demonstrations are being carried out, for using the technology in the areas of life sciences, environmental modeling and geo-hazard prediction. The building blocks in Discovery Net are the so-called Knowledge Discovery Services (KDS), distinguished in Computation Services and Data Services. The former typically comprise algorithms, e.g. data preparation and Data Mining, while the latter define relational tables (as queries) and other data sources. Both kinds of services are described (and registered) by means of Adapters, providing information such as input and output types, parameters, loca- tion and/or platform/operating system constraints, factories (objects allowing to retrieve references to services and to download them), keywords and a human-readable description. KDS are used to compose moderately complex data-pipelined processes. The composition may be carried out by means of a GUI which provides access to a library of services. The XML-based language used to describe processes is called Discovery Process Markup Language (DPML). Each composed process can be deployed and published as a new process. Typically, process descriptions are not bound to specific servers since the actual resources are later resolved by lookup servers (see below). Discovery Net is based on an open architecture using common protocols and infrastructures such as Globus Toolkit. Servers are distinguished into (i) Knowledge Servers, allowing storage and retrieval of knowledge (meant as raw data and knowledge models) and processes; (ii) Resource Discovery Servers, providing a knowledge base of service definitions and per- forming resource resolution; (iii) Discovery Meta-Information Servers, used to store information about the Knowledge Schema, i.e. the sets of features of known databases, their types, and how they can be composed with each other. Finally, we outline here some interesting Data Mining testbeds developed at the National Center for Data Mining (NCDM) at the University of Illinois at Chicago (UIC) (www.ncdm.uic. edu/testbeds.htm • The Terra Wide Data Mining Testbed (TWDM). TWDM is an infrastructure for the remote analysis, distributed mining, and real-time exploration of scientific, engineering, business, and other complex data. It consists of five geographically distributed nodes linked by optical networks through StarLight (an advanced optical infrastructure) in Chicago. These sites include StarLight itself, the Laboratory for Advanced Computing at UIC, SARA in Amsterdam, and the Dalhousie University in Halifax. In 2003 new sites will be connected, including the Imperial College in London. A central idea in TWDM is to keep generated predictive models up-to-date with respect to newly available data, in order to achieve better predictions (as this is an important aspect in many “critical” domains, such as infectious disease tracking). TWDM is based on DataSpace, another NCDM project for supporting real-time streaming data; in DataSpace the Data Tranformation Markup Language (DTML) is used to describe how to update “profiles”, i.e. aggregate data which are inputs of predictive models, on the basis of new “events”, i.e. new bits of information. • The Terabyte Challenge Testbed. The Terabyte Challenge Testbed is an open, distributed testbed for DataSpace tools, services, and protocols. It involves a number of organizations, including the University of Illinois at Chicago, the University of Pennsylvania, the Uni- versity of California at Davis and the Imperial College. The testbed consists of ten sites distributed over three continents connected by high–performance links. Each site provides a number of local clusters of workstations which are connected to form wide area meta-clusters maintained by the National Scalable Cluster Project. So far, meta-clusters have been used by applications in high energy physics, computational chemistry, nonlin- ear simulation, bioinformatics, medical imaging, network traffic analysis, digital libraries Center for Data Mini • The Terra Wide Data Mining Testbed ): 53 Parallel And Grid-Based Data Mining 1021 of video data, etc. Currently, the Terabyte Challenge Testbed consists of approximately 100 nodes and 2 terabytes of disk storage. • The Global Discovery Network (GDN). The GDN is a collaboration between the Labora- tory for Advanced Computing of the National Center for Data Mining and the Discovery Net project (see above). It will link the Discovery Net to the Terra Wide Data Mining Testbed to create a combined global testbed with a critical mass of data. The GridMiner project at the University of Vienna aims to cover the main aspects of knowledge discovery on Grids. GridMiner is a model based on the OGSA framework (Foster et al., 2002), and embraces an open architecture in which a set of services are defined for handling data distribution and heterogeneity, supporting different types of analysis strategies, as well as tools and algorithms, and providing for OLAP support. Key components in GridMiner are the Data Access service, the Data Mediation service, and the Data Mining service. Data Access implements the data access to databases and data repositories; Data Mediation provides for a view of distributed data by logically integrating them into virtual data sources (VDS) and allowing to send queries to them and combine and deliver back the results. The Data Mining layer comprises a set of specific services useful to prepare and execute a Data Mining application, as well as present its results. The system has not been yet implemented on a Grid; a preliminary fully centralized version of the system is currently available. GATES (Grid-based AdapTive Execution on Streams) is an OGSA based system that provides support for processing data streams in a Grid environment (Agrawal, 2003). GATES aims to support the distributed analysis of data streams arising from distributed sources (e.g., data from large–scale experiments/simulations), providing automatic resource discovery, and an interface for enabling self-adaptation to meet real-time constraints. Some of the systems discussed above support specific domains applications, others support a more general class of problems. Moreover, some of such systems are mainly advanced interfaces for integrating, accessing, and elaborating large datasets, whereas others provide more specific functionalities for the support of typical knowledge discovery processes. In the next section we present a Grid-based environment, named Knowledge Grid, whose aim is to support general PDKD applications, providing an interface both to manage and access large remote data sets, and to execute high-performance data analysis on them. 53.4 The Knowledge Grid The Knowledge Grid (Cannataro and Talia, 2003) is an environment providing knowledge discovery services for a wide range of high–performance distributed applications. Data sets and Data Mining and analysis tools used in such applications are increasingly becoming available as stand-alone packages and as remote services on the Internet. Examples include gene and DNA databases, network access and intrusion data, drug features and effects data repositories, astronomy data files, and data about web usage, content, and structure. Knowledge discovery procedures in all these applications typically require the creation and management of complex, dynamic, multi-step workflows. At each step, data from various sources can be moved, filtered, and integrated and fed into a Data Mining tool. Based on the output results, the analyst chooses which other data sets and mining components should be integrated in the workflow, or how to iterate the process to get a knowledge model. Workflows are mapped on a Grid by assigning its nodes to the Grid hosts and using interconnections for implementing communication among the workflow nodes. 1022 Antonio Congiusta, Domenico Talia, and Paolo Trunfio The Knowledge Grid supports such activities by providing mechanisms and high–level services for searching resources, representing, creating, and managing knowledge discovery processes, and for composing existing data services and data mining services in a structured manner, allowing designers to plan, store, document, verify, share and re-execute their workflows as well as manage their output results. The Knowledge Grid architecture is composed of a set of services divided in two layers: the Core K-Grid layer that interfaces the basic and generic Grid middleware services and the High-level K-Grid layer that interfaces the user by offering a set of services for the design and execution of knowledge discovery applications. Both layers make use of repositories that provide information about resource metadata, execution plans, and knowledge obtained as result of knowledge discovery applications. S 3 D 3 H 2 S 1 S 2 D 1 H 3 H 1 D 2 S 3 D 1 H 2 D 4 S 1 D 3 H 3 D 5 Component selection Application workflow composition Application execution on the Grid Fig. 53.1. Main steps of application composition and execution in the Knowledge Grid In the Knowledge Grid environment, discovery processes are represented as workflows that a user may compose using both concrete and abstract Grid resources. Knowledge discovery workflows are defined using a visual interface that shows resources (data, tools, and hosts) to the user and offers mechanisms for integrating them in a workflow. Information about single resources and workflows are stored using an XML-based notation that represents a workflow (called execution plan in the Knowledge Grid terminology) as a data-flow graph of nodes, each representing either a Data Mining service or a data transfer service. The XML representation allows the workflows for discovery processes to be easily validated, shared, translated in executable scripts, and stored for future executions. Figure 53.1 shows the main steps of the composition and execution processes of a knowledge discovery application on the Knowledge Grid. 53 Parallel And Grid-Based Data Mining 1023 53.4.1 Knowledge Grid Components and Tools Figure 53.2 shows the general structure of the Knowledge Grid system and its main components and interaction patterns. KBR KEPR KMR DAS Data Access Service TAAS Tools and Algorithms A ccess Service EPMS Execution Plan Management Service RPS Results Presentation Service KDS Knowledge Directory Service RAEMS Resource Allocation and Execution Mgmt. Service High level K-Grid layer Core K-Grid layer Resource Metadata Execution Plan Metadata Model Metadata Fig. 53.2. The Knowledge Grid general structure and components The High-level K-Grid layer includes services used to compose, validate, and execute a parallel and distributed knowledge discovery computation. Moreover, the layer offers services to store and analyze the discovered knowledge. Main services of the High-level K-Grid layer are: • The Data Access Service (DAS) allows for the search, selection, transfer, transformation, and delivery of data to be mined. • The Tools and Algorithms Access Service (TAAS) is responsible for searching, selecting and downloading Data Mining tools and algorithms. • The Execution Plan Management Service (EPMS). An execution plan is represented by a graph describing interactions and data flows among data sources, extraction tools, Data Mining tools, and visualization tools. The Execution Plan Management Service allows for defining the structure of an application by building the corresponding graph and adding a set of constraints about resources. Generated execution plans are stored, through the RAEMS, in the Knowledge Execution Plan Repository (KEP R). • The Results Presentation Service (RPS) offers facilities for presenting and visualizing the extracted knowledge models (e.g., association rules, clustering models, classifications). The Core K-Grid layer includes two main services: • The Knowledge Directory Service (KDS) that manages metadata describing Knowledge Grid resources. Such resources comprise hosts, repositories of data to be mined, tools and 1024 Antonio Congiusta, Domenico Talia, and Paolo Trunfio algorithms used to extract, analyze, and manipulate data, distributed knowledge discovery execution plans and knowledge obtained as result of the mining process. The metadata information is represented by XML documents stored in a Knowledge Metadata Repository (KMR). • The Resource Allocation and Execution Management Service (RAEMS) is used to find a suitable mapping between an “abstract” execution plan (formalized in XML) and available resources, with the goal of satisfying the constraints (computing power, storage, memory, database, network performance) imposed by the execution plan. After the execution plan activation, this service manages and coordinates the application execution and the storing of knowledge results in the Knowledge Base Repository (KBR). An Application Scenario We discuss here a simple meta-learning process over the Knowledge Grid, to show how the execution of a distributed Data Mining application can benefit from the Knowledge Grid services (Cannataro et al., 2002B). Meta-learning aims to generate a number of independent classifiers by applying learning programs to a collection of distributed data sets in parallel. The classifiers computed by learning programs are then collected and combined to obtain a global classifier (Prodromidis et al., 2000). Figure 53.3 shows a distributed meta-learning scenario, in which a global classifier GC is obtained on NodeZ starting from the original data set DS stored on NodeA. Partitioner P Data Set DS NodeA Learner L1 Training Set TR1 Step 1 Combiner/Tester CT Validation Set VS Testing Set TS Classifier C1 Global Classifier GC Step 2 Step 3 … Training Set TRi Training Set TRn Nodei Learner Li Learner Ln … NodeZ Classifier Ci Classifier Cn … … Node1 Noden Fig. 53.3. A distributed meta-learning scenario This process can be described through the following steps: 1. On NodeA, training sets TR 1 ,. . . ,TR n , testing set TS and validation set VS are extracted from DS by the partitioner P. Then TR 1 ,. . . ,TR n , TS and VS are respectively moved from NodeA to Node 1 ,. . . ,Node n , and to NodeZ. 2. On each Node i (i=1, ,n) the classifier C i is trained from TR i by the learner L i . Then each C i is moved from Node i to NodeZ. 53 Parallel And Grid-Based Data Mining 1025 3. On NodeZ, the C 1 , ,C n classifiers are combined and tested on TS and validated on VS by the combiner/tester CT to produce the global classifier GC. To design such an application, a Knowledge Grid user interacts with the EPMS service, which provides a visual interface - see below - to compose a workflow (execution plan) describing at a high level the needed activities involved in the overall Data Mining computation. Through the execution plan, computing, software and data resources are specified along with a set of requirements on them. In our example the user requires a set of n nodes providing the Learner software and a node providing the Combiner/Tester software, all of them satisfying given platform constraints and performance requirements. In addition, the execution plan includes information about how to coordinate the execution of all the steps, as outlined above. The execution plan is then processed by the RAEMS, which takes care of its allocation. In particular, it first finds appropriate resources matching user requirements (i.e., a set of concrete hosts Node 1 ,. . . ,Node n offering the software L, and a host Node Z providing the CT software), using the KDS services. Next, it manages the execution of the overall application, enforcing dependencies among data extraction, transfer, and mining steps, as specified in the execution plan. The operations of data extraction and transfer are performed at a lower level by invoking the DAS services. We observe here that, where needed, the RAEMS may perform software staging by means of the TAAS service. Finally, the RAEMS manages results retrieving (i.e., transferring of the global classifier GC to the user host), and visualizes them using the RPS facilities. Implementation A software environment that implements the main components of the Knowledge Grid, com- prising services and functionalities ranging from information and discovery services to visual design and execution facilities is VEGA - Visual Environment for Grid Applications - (Can- nataro et al., 2002A, Cannataro et al., 2002A). The main goal of VEGA is to offer a set of visual functionalities that give users the possibility to design applications starting from a view of the present Grid status (i.e., available nodes and resources), and composing the different stages constituting them inside a structured environment. The high-level features offered by VEGA are intended to provide the user with easy access to Grid facilities with a high level of abstraction, in order to leave her free to concentrate on the application design process. To fulfill this aim VEGA builds a visual environment based on the component framework concept, by using and enhancing basic services offered by the Knowledge Grid and the Globus Toolkit. Key concepts in the VEGA approach to the design of a Grid application are the visual language used to describe in a component-like manner, and through a graphical representation, the jobs constituting an application, and the possibility to group these jobs in workspaces to form specific interdependent stages. A consistency checking module parses the model of the computation both while the design is in progress and prior to execute it, monitoring and driving user actions so as to obtain a correct and consistent graphical representation of the application. Together with the workspace concept, VEGA makes available also the virtual resource abstraction; thanks to these entities it is possible to compose applications working on data processed/generated in previous phases even if the execution has not been performed yet. VEGA includes an execution service, which gives the user the possibility to execute the designed application, monitor its status, and visualize results. 1026 Antonio Congiusta, Domenico Talia, and Paolo Trunfio 53.5 Summary Parallel and Grid-based Data Mining are key technologies to enhance performance of knowledge discovery processes on large amount of data. Parallel Data Mining is a mature area that produced algorithms and techniques broadly integrated in data mining systems and suites. Today parallel Data Mining systems and algorithms can be integrated as components of Grid- based systems to develop high-performance knowledge discovery applications. This chapter introduced Data Mining techniques on parallel architectures, showing how large-scale Data Mining and knowledge discovery applications can achieve scalability by using systems, tools and performance offered by parallel processing systems. Some experiences and results in parallelizing Data Mining algorithms according to different approaches have been also reported. To perform Data Mining on massive data sets, distributed across multiple sites, knowledge discovery systems based on Grid infrastructures are emerging. The chapter discussed the main benefits coming from the use of Grid models and platforms in developing distributed knowledge discovery systems, analyzing some emerging Grid-based Data Mining systems. Parallel and Grid-based Data Mining will play a more and more important role for data analysis and knowledge extraction in several application contexts. The Knowledge Grid environment, we shortly described here, is a representative effort to build a Grid-based parallel and distributed knowledge discovery system for a wide set of high–performance distributed applications. References Agrawal G. High-level Interfaces and Abstractions for Grid-based Data Mining. Workshop on Data Mining and Exploration Middleware for Distributed and Grid Computing; 2003 September 18–19; Minneapolis, MI. Agrawal R., Shafer J.C. Parallel Mining of Association Rules. IEEE Transactions on Knowl- edge and Data Engineering 1996; 8: 962-969. Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules. Proceedings of the 20th International Conference on Very Large Databases; 1994; Santiago, Chile. Berman F. From TeraGrid to Knowledge Grid. Communications of the ACM 2001; 44(11): 27-28. Berry, M. JA, Linoff, G., Data Mining Techniques for Marketing, Sales, and Customer Sup- port. New York: Wiley Computer Publishing, 1997. Beynon M, Kurc T, Catalyurek U, Chang C, Sussman A, Saltz J. Distributed Processing of Very Large Datasets with DataCutter. Parallel Computing 2001. 27(11):1457-1478. Bigus, J. P., Data Mining with Neural Networks. New York: McGraw-Hill, 1996. Bruynooghe M., Parallel Implementation of Fast Clustering Algorithms. Proceedings of the International Symposium on High Performance Computing; 1989 March 22-24; Mont- pellier, France. Elsevier Science, 1989; 65-78. Cannataro M, Congiusta A, Talia D, Trunfio P. A Data Mining Toolset for Distributed High- performance Platforms. Proceedings of the International Conference on Data Mining Methods and Databases for Engineering; 2002 September 25-27; Bologna, Italy. Wessex Institute Press, 2002; 41-50. Cannataro M., Talia D. The Knowledge Grid. Communications of the ACM 2003; 46(1):89- 93. 53 Parallel And Grid-Based Data Mining 1027 Cannataro M, Talia D, Trunfio P. KNOWLEDGE GRID: High Performance Knowledge Dis- covery Services on the Grid. Proceedings of the 2nd International Workshop GRID 2001; 2001 November; Denver, CO. Springer-Verlag, 2001; LNCS 2242:38-50. Cannataro M., Talia D., Trunfio P. Distributed Data Mining on the Grid. Future Generation Computer Systems 2002. 18(8):1101-1112. Congiusta A, Talia D, Trunfio P. VEGA: A Visual Environment for Developing Complex Grid Applications. Proceedings of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI); 2003 October 13; Halifax, Canada. Catlett C. The TeraGrid: a Primer, 2002. Curcin V, Ghanem M, Guo Y, Kohler M, Rowe A, Syed J, Wendel P. Discovery Net: Towards a Grid of Knowledge Discovery. Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining; 2002 July 23-26; Edmonton, Canada. Foster I, Kesselman C, Nick J, Tuecke S (2002). The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Foti D, Lipari D, Pizzuti C, Talia D. Scalable Parallel Clustering for Data Mining on Mul- ticomputers. Proceedings of the 3rd International Workshop on High Performance Data Mining; 2000; Cancun. Springer-Verlag, 2000; LNCS 1800:390-398. Freitas, A. A., Lavington, S. H, Mining Very Large Database with Parallel Processing. Boston: Kluwer Academic Publishers, 1998. Giannadakis N., Rowe A., Ghanem M., Guo Y. InfoGrid: Providing Information Integration for Knowledge Discovery. Information Sciences 2003; 155:199-226. Han E. H., Karypis G., Kumar V. Scalable Parallel Data Mining for Association Rules. IEEE Transactions on Knowledge and Data Engineering 2000; 12(2):337-352 Hinke T., Novonty J. Data Mining on NASA’s Information Power Grid. Proceedings 9th International Symposium on High Performance Distributed Computing; 2000 August 1-4; Pittsburgh, PA. Johnston W. E. Computational and Data Grids in Large-Scale Science and Engineering. Fu- ture Generation Computer Systems 2002; 18(8):1085-1100. Judd D, McKinley K, Jain AK. Large-Scale Parallel Data Clustering. Proceedings of the International Conference On Pattern Recognition; 1996; Wien. Kargupta, H., Chan, P. (Eds.), Advances in Distributed and Parallel Knowledge Discovery. Boston: AAAI/MIT Press, 2000. Kufrin R. Generating C4.5 Production Rules in Parallel. Proceedings of the 14th National Conference on Artificial Intelligence; AAAI Press, 1997. Li X., Fang Z. Parallel Clustering Algorithms. Parallel Computing 1989; 11: 275–290. Moore R.W. (2001). Knowledge-Based Grids: Two Use Cases. GGF-3 Meeting. Neri F, Giordana A. A Parallel Genetic Algorithm for Concept Learning. Proceedings of the 6th International Conference on Genetic Algorithms; 1995 July 15-19; Pittsburgh, PA. Morgan Kaufmann, 1995; 436-443. Olson C.F. Parallel Algorithms for Hierarchical Clustering. Parallel Computing 1995; 21:1313-1325. Pearson, R. A. “A Coarse-grained Parallel Induction Heuristic.” In Parallel Processing for Artificial Intelligence 2, H. Kitano, V. Kumar, C.B. Suttner, ed. Elsevier Science, 1994. Prodromidis, A. L., Chan, P. K., Stolfo, S. J. “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches”, In Advances in Distributed and Parallel Knowledge Discovery, H. Kargupta, P. Chan, ed. AAAI Press, 2000. Shafer J, Agrawal R, Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining. Proceedings of the 22nd International Conference Very Large Databases; 1996; Bombay. 1028 Antonio Congiusta, Domenico Talia, and Paolo Trunfio Skillicorn D. Strategies for Parallel Data Mining. IEEE Concurrency 1999; 7(4):26-35. Skillicorn D., Talia D. Mining Large Data Sets on Grids: Issues and Prospects. Computing and Informatics 2002; 21:347-362. Witten, I. H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000. Zaki M.J. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency 1999; 7(4):14-25. 54 Collaborative Data Mining Steve Moyle Oxford University Computing Laboratory Summary. Collaborative Data Mining is a setting where the Data Mining effort is distributed to multiple collaborating agents – human or software. The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are consid- ered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents. The solutions require evaluation, comparison, and approaches for combination. Collaboration requires communication, and implies some form of community. The human form of collaboration is a social task. Organizing communities in an effective manner is non-trivial and often requires well defined roles and processes. Data Min- ing, too, benefits from a standard process. This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting. Key words: Collaborative Data Mining, CRISP-DM, ROC 54.1 Introduction Data Mining is about solving problems using data (Witten and Frank, 2000), and such it is normally a creative activity leveraging human intelligence. This is similar to the spirit and practices of scientific discovery (Bacon, 1994, Popper, 1977,Kuhn, 1970) which utilize many techniques including induction, abduction, hunches, and clever guessing to propose hypothe- ses that aid in understanding the problem and finally lead to a solution. Collaboration is the act of working together with one or more people in order to achieve something (Soukhanov, 2001). Collaboration in intelligence-intensive activities may lead to increased results. However, collaboration brings its own difficulties including communication, coordination, as well as cul- tural and social difficulties. Some of these difficulties can be analyzed by the e-Collaboration Space model (McKenzie and van Winkelen, 2001). Data Mining projects benefit from a rigorous process and methodology (Adriaans and Zantinge, 1996, Fayyad et al., 1996, Chapman et al., 2000). For collaborative Data Mining, such processes need to be embedded in a broader set of processes that support the collaboration O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_54, © Springer Science+Business Media, LLC 2010 . service, the Data Mediation service, and the Data Mining service. Data Access implements the data access to databases and data repositories; Data Mediation provides for a view of distributed data by. gene and DNA databases, network access and intrusion data, drug features and effects data repositories, astronomy data files, and data about web usage, content, and structure. Knowledge discovery. the composition and execution processes of a knowledge discovery application on the Knowledge Grid. 53 Parallel And Grid-Based Data Mining 1 023 53.4.1 Knowledge Grid Components and Tools Figure 53 .2 shows

Định dạng
Số trang	10
Dung lượng	394,99 KB