Data Mining and Statistics for Decision Making [Tufféry 2011-04-18]

W I L E Y S E R I E S I N C O M P U TAT I O N A L S TAT I S T I C S Stéphane Tufféry, University of Rennes, France With Forewords by Gilbert Saporta and David J Hand Translated by Rod Riesco Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge Data mining is usually associated with a business or an organization’s need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives This book looks at both classical and modern methods of data mining, such as clustering, discriminate analysis, decision trees, neural networks and support vector machines along with illustrative examples throughout the book to explain the theory of these models Recent methods such as bagging and boosting, decision trees, neural networks, support vector machines and genetic algorithm are also discussed along with their advantages and disadvantages Key Features: Presents a comprehensive introduction to all techniques used in data mining and statistical learning Includes coverage of data mining with R as well as a thorough comparison of the two industry leaders, SAS and SPSS Gives practical tips for data mining implementation as well as the latest techniques and state of the art theory Looks at a range of methods, tools and applications, such as scoring to web mining and text mining and presents their advantages and disadvantages Supported by an accompanying website hosting datasets and user analysis Business intelligence analysts and statisticians, compliance and financial experts in both commercial and government organizations across all industry sectors will benefit from this book DATA MINING AND STATISTICS FOR DECISION MAKING DATA MINING AND STATISTICS FOR DECISION MAKING Tufféry W I L E Y S E R I E S I N C O M P U TAT I O N A L S TAT I S T I C S Stéphane Tufféry DATA MINING AND STATISTICS FOR DECISION MAKING www.wiley.com/go/decision_making Red box rules are for proof stage only Delete before final printing Data Mining and Statistics for Decision Making Wiley Series in Computational Statistics Consulting Editors: Paolo Giudici University of Pavia, Italy Geof H Givens Colorado State University, USA Bani K Mallick Texas A&M University, USA Wiley Series in Computational Statistics is comprised of practical guides and cutting edge research books on new developments in computational statistics It features quality authors with a strong applications focus The texts in the series provide detailed coverage of statistical concepts, methods and case studies in areas at the interface of statistics, computing, and numerics With sound motivation and a wealth of practical examples, the books show in concrete terms how to select and to use appropriate ranges of statistical computing techniques in particular fields of study Readers are assumed to have a basic understanding of introductory terminology The series concentrates on applications of computational methods in statistics to fields of bioinformatics, genomics, epidemiology, business, engineering, finance and applied statistics Titles in the Series Biegler, Biros, Ghattas, Heinkenschloss, Keyes, Mallick, Marzouk, Tenorio, Waanders, Willcox – Large-Scale Inverse Problems and Quantification of Uncertainty Billard and Diday – Symbolic Data Analysis: Conceptual Statistics and Data Mining Bolstad – Understanding Computational Bayesian Statistics Borgelt, Steinbrecher and Kruse – Graphical Models, 2e Dunne – A Statistical Approach to Neutral Networks for Pattern Recognition Liang, Liu and Carroll – Advanced Markov Chain Monte Carlo Methods Ntzoufras – Bayesian Modeling Using WinBUGS Data Mining and Statistics for Decision Making Ste´phane Tuffe´ry University of Rennes, France Translated by Rod Riesco First published under the title ‘Data Mining et Statistique Decisionnelle’ by Editions Technip Ó Editions Technip 2008 All rights reserved Authorised translation from French language edition published by Editions Technip, 2008 This edition first published 2011 Ó 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988 All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought Library of Congress Cataloging-in-Publication Data Tuffery, Stephane Data mining and statistics for decision making / Stephane Tuffery p cm – (Wiley series in computational statistics) Includes bibliographical references and index ISBN 978-0-470-68829-8 (hardback) Data mining Statistical decision I Title QA76.9.D343T84 2011 006.3’12–dc22 2010039789 A catalogue record for this book is available from the British Library Print ISBN: 978-0-470-68829-8 ePDF ISBN: 978-0-470-97916-7 oBook ISBN: 978-0-470-97917-4 ePub ISBN: 978-0-470-97928-0 Typeset in by 10/12pt Times Roman by Thomson Digital, Noida, India to Paul and Nicole Tuffe´ry, with gratitude and affection Contents Preface xvii Foreword xxi Foreword from the French language edition List of trademarks xxiii xxv Overview of data mining 1.1 What is data mining? 1.2 What is data mining used for? 1.2.1 Data mining in different sectors 1.2.2 Data mining in different applications 1.3 Data mining and statistics 1.4 Data mining and information technology 1.5 Data mining and protection of personal data 1.6 Implementation of data mining 1 4 11 12 16 23 The development of a data mining study 2.1 Defining the aims 2.2 Listing the existing data 2.3 Collecting the data 2.4 Exploring and preparing the data 2.5 Population segmentation 2.6 Drawing up and validating predictive models 2.7 Synthesizing predictive models of different segments 2.8 Iteration of the preceding steps 2.9 Deploying the models 2.10 Training the model users 2.11 Monitoring the models 2.12 Enriching the models 2.13 Remarks 2.14 Life cycle of a model 2.15 Costs of a pilot project 25 26 26 27 30 33 35 36 37 37 38 38 40 41 41 41 Data exploration and preparation 3.1 The different types of data 3.2 Examining the distribution of variables 3.3 Detection of rare or missing values 3.4 Detection of aberrant values 3.5 Detection of extreme values 43 43 44 45 49 52 viii CONTENTS 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 Tests of normality Homoscedasticity and heteroscedasticity Detection of the most discriminating variables 3.8.1 Qualitative, discrete or binned independent variables 3.8.2 Continuous independent variables 3.8.3 Details of single-factor non-parametric tests 3.8.4 ODS and automated selection of discriminating variables Transformation of variables Choosing ranges of values of binned variables Creating new variables Detecting interactions Automatic variable selection Detection of collinearity Sampling 3.15.1 Using sampling 3.15.2 Random sampling methods 52 58 59 60 62 65 70 73 74 81 82 85 86 89 89 90 Using commercial data 4.1 Data used in commercial applications 4.1.1 Data on transactions and RFM data 4.1.2 Data on products and contracts 4.1.3 Lifetimes 4.1.4 Data on channels 4.1.5 Relational, attitudinal and psychographic data 4.1.6 Sociodemographic data 4.1.7 When data are unavailable 4.1.8 Technical data 4.2 Special data 4.2.1 Geodemographic data 4.2.2 Profitability 4.3 Data used by business sector 4.3.1 Data used in banking 4.3.2 Data used in insurance 4.3.3 Data used in telephony 4.3.4 Data used in mail order 93 93 93 94 94 96 96 97 97 98 98 98 105 106 106 108 108 109 Statistical and data mining software 5.1 Types of data mining and statistical software 5.2 Essential characteristics of the software 5.2.1 Points of comparison 5.2.2 Methods implemented 5.2.3 Data preparation functions 5.2.4 Other functions 5.2.5 Technical characteristics 5.3 The main software packages 5.3.1 Overview 111 111 114 114 115 116 116 117 117 117 ... What is data mining? 1.2 What is data mining used for? 1.2.1 Data mining in different sectors 1.2.2 Data mining in different applications 1.3 Data mining and statistics 1.4 Data mining and information... information Foreword It is a real pleasure to be invited to write the foreword to the English translation of Stephane Tuffery’s book Data Mining and Statistics for Decision Making Data mining. .. ever, this book covers all the essentials (and more) needed for a clear understanding and proper application of data mining and statistics for decision making Among the new features in this edition,

Định dạng
Số trang	717
Dung lượng	11,43 MB