Data Mining in Grid Computing Environments [Dubitzky 2008-12-22]

Data Mining Techniques in Grid Computing Environments Editor Werner Dubitzky University of Ulster, UK Data Mining Techniques in Grid Computing Environments Editor Werner Dubitzky University of Ulster, UK This edition first published 2008 © 2008 by John Wiley & Sons, Ltd Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical and Medical business with Blackwell Publishing Registered office: John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Other Editorial Offices: 9600 Garsington Road, Oxford, OX4 2DQ, UK 111 River Street, Hoboken, NJ 07030-5774, USA For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988 All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought Library of Congress Cataloguing-in-Publication Data Dubitzky, Werner, 1958Data mining techniques in grid computing environments / Werner Dubitzky p cm Includes bibliographical references and index ISBN 978-0-470-51258-6 (cloth) Data mining Computational grids (Computer systems) I Title QA76.9.D343D83 2008 004’.36–dc22 2008031720 ISBN: 978 470 51258 A catalogue record for this book is available from the British Library Set in 10/12 pt Times by Thomson Digital, Noida, India Printed in Singapore by Markono Pte First printing 2008 Contents Preface xiii List of Contributors xvii Data mining meets grid computing: Time to dance? Alberto Sánchez, Jesús Montes, Werner Dubitzky, Julio J Valdés, Mar´ıa S Pérez and Pedro de Miguel 1.1 Introduction 1.2 Data mining 1.2.1 Complex data mining problems 1.2.2 Data mining challenges 1.3 Grid computing 1.3.1 Grid computing challenges 1.4 Data mining grid – mining grid data 1.4.1 Data mining grid: a grid facilitating large-scale data mining 1.4.2 Mining grid data: analyzing grid systems with data mining techniques 1.5 Conclusions 1.6 Summary of Chapters in this Volume Data analysis services in the knowledge grid 3 9 11 12 13 17 Eugenio Cesario, Antonio Congiusta, Domenico Talia and Paolo Trunfio 2.1 Introduction 2.2 Approach 2.3 Knowledge Grid services 2.3.1 The Knowledge Grid architecture 2.3.2 Implementation 2.4 Data analysis services 2.5 Design of Knowledge Grid applications 2.5.1 The VEGA visual language 2.5.2 UML application modelling 2.5.3 Applications and experiments 2.6 Conclusions 17 18 20 21 24 29 31 31 32 33 34 vi CONTENTS GridMiner: An advanced support for e-science analytics 37 Peter Brezany, Ivan Janciak and A Min Tjoa 3.1 3.2 3.3 3.4 Introduction Rationale behind the design and development of GridMiner Use Case Knowledge discovery process and its support by the GridMiner 3.4.1 Phases of knowledge discovery 3.4.2 Workflow management 3.4.3 Data management 3.4.4 Data mining services and OLAP 3.4.5 Security 3.5 Graphical user interface 3.6 Future developments 3.6.1 High-level data mining model 3.6.2 Data mining query language 3.6.3 Distributed mining of data streams 3.7 Conclusions 37 39 40 41 42 45 46 47 49 50 52 52 52 52 53 ADaM services: Scientific data mining in the service-oriented architecture paradigm 57 Rahul Ramachandran, Sara Graves, John Rushing, Ken Keyzer, Manil Maskey, Hong Lin and Helen Conover 4.1 4.2 4.3 4.4 4.5 Introduction ADaM system overview ADaM toolkit overview Mining in a service-oriented architecture Mining web services 4.5.1 Implementation architecture 4.5.2 Workflow example 4.5.3 Implementation issues 4.6 Mining grid services 4.6.1 Architecture components 4.6.2 Workflow example 4.7 Summary Mining for misconfigured machines in grid systems 58 58 60 61 62 63 64 64 66 67 68 69 71 Noam Palatin, Arie Leizarowitz, Assaf Schuster and Ran Wolff 5.1 Introduction 5.2 Preliminaries and related work 5.2.1 System misconfiguration detection 5.2.2 Outlier detection 5.3 Acquiring, pre-processing and storing data 5.3.1 Data sources and acquisition 5.3.2 Pre-processing 5.3.3 Data organization 71 73 73 74 75 75 75 76 CONTENTS 5.4 Data analysis 5.4.1 General approach 5.4.2 Notation 5.4.3 Algorithm 5.4.4 Correctness and termination 5.5 The GMS 5.6 Evaluation 5.6.1 Qualitative results 5.6.2 Quantitative results 5.6.3 Interoperability 5.7 Conclusions and future work FAEHIM: Federated Analysis Environment for Heterogeneous Intelligent Mining vii 77 77 78 78 80 80 82 82 83 85 88 91 Ali Shaikh Ali and Omer F Rana 6.1 Introduction 6.2 Requirements of a distributed knowledge discovery framework 6.2.1 Category 1: knowledge discovery specific requirements 6.2.2 Category 2: distributed framework specific requirements 6.3 Workflow-based knowledge discovery 6.4 Data mining toolkit 6.5 Data mining service framework 6.6 Distributed data mining services 6.7 Data manipulation tools 6.8 Availability 6.9 Empirical experiments 6.9.1 Evaluating the framework accuracy 6.9.2 Evaluating the running time of the framework 6.10 Conclusions 91 93 93 94 94 95 96 99 100 101 101 102 103 104 Scalable and privacy preserving distributed data analysis over a service-oriented platform 105 William K Cheung 7.1 Introduction 7.2 A service-oriented solution 7.3 Background 7.3.1 Types of distributed data analysis 7.3.2 A brief review of distributed data analysis 7.3.3 Data mining services and data analysis management systems 7.4 Model-based scalable, privacy preserving, distributed data analysis 7.4.1 Hierarchical local data abstractions 7.4.2 Learning global models from local abstractions 7.5 Modelling distributed data mining and workflow processes 7.5.1 DDM processes in BPEL4WS 7.5.2 Implementation details 105 106 107 107 108 108 109 109 110 111 111 112 viii CONTENTS 7.6 Lessons learned 7.6.1 Performance of running distributed data analysis on BPEL 7.6.2 Issues specific to service-oriented distributed data analysis 7.6.3 Compatibility of Web services development tools 7.7 Further research directions 7.7.1 Optimizing BPEL4WS process execution 7.7.2 Improved support of data analysis process management 7.7.3 Improved support of data privacy preservation 7.8 Conclusions Building and using analytical workflows in Discovery Net 112 112 113 114 114 114 115 115 116 119 Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo 8.1 Introduction 8.1.1 Workflows on the grid 8.2 Discovery Net system 8.2.1 System overview 8.2.2 Workflow representation in DPML 8.2.3 Multiple data models 8.2.4 Workflow-based services 8.2.5 Multiple execution models 8.2.6 Data flow pull model 8.2.7 Streaming and batch transfer of data elements 8.2.8 Control flow push model 8.2.9 Embedding 8.3 Architecture for Discovery Net 8.3.1 Motivation for a new server architecture 8.3.2 Management of hosting environments 8.3.3 Activity management 8.3.4 Collaborative workflow platform 8.3.5 Architecture overview 8.3.6 Activity service definition layer 8.3.7 Activity services bus 8.3.8 Collaboration and execution services 8.3.9 Workflow Services Bus 8.3.10 Prototyping and production clients 8.4 Data management 8.5 Example of a workflow study 8.5.1 ADR studies 8.5.2 Analysis overview 8.5.3 Service for transforming event data into patient annotations 8.5.4 Service for defining exclusions 8.5.5 Service for defining exposures 8.5.6 Service for building the classification model 8.5.7 Validation service 8.5.8 Summary 8.6 Future directions 119 120 121 121 122 123 123 123 124 124 125 125 126 126 127 127 127 127 129 130 130 130 130 131 133 133 133 134 134 135 135 135 136 136 CONTENTS Building workflows that traverse the bioinformatics data landscape ix 141 Robert Stevens, Paul Fisher, Jun Zhao, Carole Goble and Andy Brass 9.1 9.2 9.3 9.4 Introduction The bioinformatics data landscape The bioinformatics experiment landscape Taverna for bioinformatics experiments 9.4.1 Three-tiered enactment in Taverna 9.4.2 The open-typing data models 9.5 Building workflows in Taverna 9.5.1 Designing a SCUFL workflow 9.6 Workflow case study 9.6.1 The bioinformatics task 9.6.2 Current approaches and issues 9.6.3 Constructing workflows 9.6.4 Candidate genes involved in trypanosomiasis resistance 9.6.5 Workflows and the systematic approach 9.7 Discussion 10 Specification of distributed data mining workflows with DataMiningGrid 141 143 143 145 146 147 148 149 150 152 153 154 156 157 159 165 Dennis Wegener and Michael May 10.1 Introduction 10.2 DataMiningGrid environment 10.2.1 General architecture 10.2.2 Grid environment 10.2.3 Scalability 10.2.4 Workflow environment 10.3 Operations for workflow construction 10.3.1 Chaining 10.3.2 Looping 10.3.3 Branching 10.3.4 Shipping algorithms 10.3.5 Shipping data 10.3.6 Parameter variation 10.3.7 Parallelization 10.4 Extensibility 10.5 Case studies 10.5.1 Evaluation criteria and experimental methodology 10.5.2 Partitioning data 10.5.3 Classifier comparison scenario 10.5.4 Parameter optimization 10.6 Discussion and related work 10.7 Open issues 10.8 Conclusions 165 167 167 167 167 168 169 169 169 170 170 170 171 171 171 173 173 173 175 175 175 176 176 ... Introduction 1.2 Data mining 1.2.1 Complex data mining problems 1.2.2 Data mining challenges 1.3 Grid computing 1.3.1 Grid computing challenges 1.4 Data mining grid – mining grid data 1.4.1 Data. .. Data mining grid: a grid facilitating large-scale data mining 1.4.2 Mining grid data: analyzing grid systems with data mining techniques 1.5 Conclusions 1.6 Summary of Chapters in this Volume Data. .. outlined above, the data mining process is in need of reformulation This leads to the concept of distributed data mining, and in particular to grid- based data mining or – in analogy to a data grid

Định dạng
Số trang	289
Dung lượng	4,92 MB