Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 590 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
590
Dung lượng
29,76 MB
Nội dung
IntroductiontoDataMiningfortheLifeSciences Rob SullivanIntroductiontoDataMiningfortheLifeSciences Rob Sullivan Cincinnati, OH, USA ISBN 978-1-58829-942-0 e-ISBN 978-1-59745-290-8 DOI 10.1007/978-1-59745-290-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011941596 # Springer Science+Business Media, LLC 2012 All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com) To my wife, without whose support, encouragement, love, and caffeine, none of this would have been possible v Preface A search forthe word “zettabyte” will return a page that predicts that we will enter the zettabyte age around 2015 To store this amount of data on DVDs would require over 215 billion disks The search itself (on September 20, 2011) returned 775,000 results, many relevant, many irrelevant, and many duplicates The challenge is to elicit knowledge from all this data Scientific endeavors are constantly generating more and more data As new generations of instruments are created, one of the characteristics is typically more sensitive results In turn, this typically means more data is generated New techniques made available by new instrumentation, techniques, and understanding allows us to consider approaches such as genome-wide association studies (GWAS) that were outside of our ability to consider just a few years ago Again, the challenge is to elicit knowledge from all this data But as we continue to generate this ever-increasing amount of data, we would also like to know what relationships and patterns exist between thedata This, in essence, is the goal of data mining: find the patterns within thedata This is what this book is about Is there some quantity X that is related to some other quantity Y that isn’t obvious to us? If so, what could those relationships tell us? Is there something novel, something new, that these patterns tell us? Can it advance our knowledge? There is no obvious end in sight tothe increasing generation of dataTothe contrary, as tools, techniques, and instrumentation continue to become smaller, cheaper, and thus, more available, it is likely that the opposite will be the case and data will continue to be generated in ever-increasing volumes It is for this reason that automated approaches to processing data, understanding data, and finding these patterns will be even more important This leads tothe major challenge of a book like this: what to include and what to leave out We felt it important to cover as much of the theory of datamining as possible, including statistical, analytical, visualization, and machine learning techniques Much exciting work is being done under the umbrella of machine learning, and much of it is seeing fruition within thedatamining discipline itself To say that vii viii Preface this covers a wide area – and a multitude of sins – is an understatement To those readers who question why we included a particular topic, and to those who question why we omitted some other topic, we can only apologize for this In writing a book aimed at introducing datamining techniques tothelife sciences, describing a broad range of techniques is a necessity to allow the researcher to select the most appropriate tools for his/her investigation Many of the techniques we discuss are not necessarily in widespread use, but we believe can be valuable tothe researcher Many people over many years triggered our enthusiasm for developing this text and thanks to them that we have created this book Particular thanks goes to Bruce Lucarelli and Viswanath Balasubramanian for their contributions and insights on various parts of the text Cincinnati, OH, USA Rob Sullivan Contents Introduction 1.1 Context 1.2 New Scientific Techniques, New Data Challenges in LifeSciences 1.3 The Ethics of DataMining 1.4 Data Mining: Problems and Issues 1.5 From Datato Information: The Process 1.5.1 CRISP-DM 1.6 What Can Be Mined? 1.7 Standards and Terminologies 1.8 Interestingness 1.9 How Good Is Good Enough? 1.10 The Datasets Used in This Book 1.11 The Rest of the Story 1.12 What Have We Missed in the Introduction? References 1 10 12 17 19 24 25 26 27 28 28 31 31 Fundamental Concepts 2.1 Introduction 2.2 DataMining Activities 2.3 Models 2.4 Input Representation 2.5 Missing Data 2.6 Does Data Expire? 2.7 Frequent Patterns 2.8 Bias 2.9 Generalization 2.10 Data Characterization and Discrimination 2.11 Association Analysis 2.12 Classification and Prediction 2.12.1 Relevance Analysis 33 33 37 38 39 40 41 42 43 45 45 46 46 47 ix .. .Introduction to Data Mining for the Life Sciences Rob Sullivan Introduction to Data Mining for the Life Sciences Rob Sullivan Cincinnati, OH, USA ISBN 978-1-58829-942-0... concepts and then move on to discussing a variety of techniques The focus of this book is data mining, but data mining with application to the life sciences The first question, therefore, is what... consider these and other issues concerned with building the data architecture to provide a robust platform for data- mining efforts Within this book, we consider the objective of data mining to provide