11.6.7 Discriminant analysis on qualitative variables (DISQUAL Method)
11.6.8 Advantages of discriminant analysis
11.6.9 Disadvantages of discriminant analysis
11.7 Prediction by linear regression
11.7.1 Simple linear regression
11.7.2 Multiple linear regression and regularized regression
11.7.3 Tests in linear regression
11.7.4 Tests on residuals
11.7.5 The influence of observations
11.7.6 Example of linear regression
11.7.7 Further details of the SAS linear regression syntax
11.7.8 Problems of collinearity in linear regression: an example using R
11.7.9 Problems of collinearity in linear regression: diagnosis and solutions
11.7.10 PLS regression
11.7.11 Handling regularized regression with SAS and R
11.7.12 Robust regression
11.7.13 The general linear model
11.8 Classification by logistic regression
11.8.1 Principles of binary logistic regression
11.8.2 Logit, probit and log-log logistic regressions
11.8.3 Odds ratios
11.8.4 Illustration of division into categories
11.8.5 Estimating the parameters
11.8.6 Deviance and quality measurement in a model
11.8.7 Complete separation in logistic regression
11.8.8 Statistical tests in logistic regression
11.8.9 Effect of division into categories and choice of the reference category
11.8.10 Effect of collinearity
11.8.11 The effect of sampling on logit regression
11.8.12 The syntax of logistic regression in SAS Software
11.8.13 An example of modelling by logistic regression
11.8.14 Logistic regression with R
11.8.15 Advantages of logistic regression
11.8.16 Advantages of the logit model compared with probit
11.8.17 Disadvantages of logistic regression
11.9 Developments in logistic regression
11.9.1 Logistic regression on individuals with different weights
11.9.2 Logistic regression with correlated data
11.9.3 Ordinal logistic regression
11.9.4 Multinomial logistic regression
11.9.5 PLS logistic regression
11.9.6 The generalized linear model
11.9.7 Poisson regression
11.9.8 The generalized additive model
11.10 Bayesian methods
11.10.1 The naive Bayesian classifier
11.10.2 Bayesian networks
11.11 Classification and prediction by neural networks
11.11.1 Advantages of neural networks
11.11.2 Disadvantages of neural networks
11.12 Classification by support vector machines
11.12.1 Introduction to SVMs
11.12.2 Example
11.12.3 Advantages of SVMs
11.12.4 Disadvantages of SVMs
11.13 Prediction by genetic algorithms
11.13.1 Random generation of initial rules
11.13.2 Selecting the best rules
11.13.3 Generating new rules
11.13.4 End of the algorithm
11.13.5 Applications of genetic algorithms
11.13.6 Disadvantages of genetic algorithms
11.14 Improving the performance of a predictive model
11.15 Bootstrapping and ensemble methods
11.15.1 Bootstrapping
11.15.2 Bagging
11.15.3 Boosting
11.15.4 Some applications
11.15.5 Conclusion
11.16 Using classification and prediction methods
11.16.1 Choosing the modelling methods
11.16.2 The training phase of a model
11.16.3 Reject inference
11.16.4 The test phase of a model
11.16.5 The ROC curve, the lift curve and the Gini index
11.16.6 The classification table of a model
11.16.7 The validation phase of a model
11.16.8 The application phase of a model
12 An application of data mining: scoring
12.1 The different types of score
12.2 Using propensity scores and risk scores
12.3 Methodology
12.3.1 Determining the objectives
12.3.2 Data inventory and preparation
12.3.3 Creating the analysis base
12.3.4 Developing a predictive model
12.3.5 Using the score
12.3.6 Deploying the score
12.3.7 Monitoring the available tools
12.4 Implementing a strategic score
12.5 Implementing an operational score
12.6 Scoring solutions used in a business
12.6.1 In-house or outsourced?
12.6.2 Generic or personalized score
12.6.3 Summary of the possible solutions
12.7 An example of credit scoring (data preparation)
12.8 An example of credit scoring (modelling by logistic regression)
12.9 An example of credit scoring (modelling by DISQUAL discriminant analysis)
12.10 A brief history of credit scoring
References
13 Factors for success in a data mining project
13.1 The subject
13.2 The people
13.3 The data
13.4 The IT Systems
13.5 The business culture
13.6 Data mining: eight common misconceptions
13.6.1 No a priori knowledge is needed
13.6.2 No specialist staff are needed
13.6.3 No statisticians are needed (‘you can just press a button’)
13.6.4 Data mining will reveal unbelievable wonders
13.6.5 Data mining is revolutionary
13.6.6 You must use all the available data
13.6.7 You must always sample
13.6.8 You must never sample
13.7 Return on investment
14 Text mining
14.1 Definition of text mining
14.2 Text sources used
14.3 Using text mining
14.4 Information retrieval
14.4.1 Linguistic analysis
14.4.2 Application of statistics and data mining
14.4.3 Suitable methods
14.5 Information extraction
14.5.1 Principles of information extraction
14.5.2 Example of application: transcription of business interviews
14.6 Multi-type data mining
15 Web mining
15.1 The aims of web mining
15.2 Global analyses
15.2.1 What can they be used for?
15.2.2 The structure of the log file
15.2.3 Using the log file
15.3 Individual analyses
15.4 Personal analysis
Appendix A: Elements of statistics
A.1 A brief history
A.1.1 A few dates
A.1.2 From statistics . . . to data mining
A.2 Elements of statistics
A.2.1 Statistical characteristics
A.2.2 Box and whisker plot
A.2.3 Hypothesis testing
A.2.4 Asymptotic, exact, parametric and non-parametric tests
A.2.5 Confidence interval for a mean: student’s t test
A.2.6 Confidence interval of a frequency (or proportion)
A.2.7 The relationship between two continuous variables: the linear correlation coefficient
A.2.8 The relationship between two numeric or ordinal variables: Spearman’s rank correlation coefficient and Kendall’s tau
A.2.9 The relationship between n sets of several continuous or binary variables: canonical correlation analysis
A.2.10 The relationship between two nominal variables: the χ² test
A.2.11 Example of use of the χ² test
A.2.12 The relationship between two nominal variables: Cramér’s coefficient
A.2.13 The relationship between a nominal variable and a numeric variable: the variance test (one-way ANOVA test)
A.2.14 The Cox semi-parametric survival model
A.3 Statistical tables
A.3.1 Table of the standard normal distribution
A.3.2 Table of Student’s t distribution
A.3.3 Chi-Square table
A.3.4 Table of the Fisher-Snedecor distribution at the 0.05 significance level
A.3.5 Table of the Fisher-Snedecor distribution at the 0.10 significance level
Appendix B: Further reading
B.1. Statistics and data analysis
B.2. Data mining and statistical learning
B.3. Text mining
B.4. Web mining
B.5. R software
B.6. SAS software
B.7. IBM SPSS software
B.8. Websites
Index
Nội dung
W I L E Y S E R I E S I N C O M P U TAT I O N A L S TAT I S T I C S Stéphane Tufféry, University of Rennes, France With Forewords by Gilbert Saporta and David J Hand Translated by Rod Riesco Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge Data mining is usually associated with a business or an organization’s need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives This book looks at both classical and modern methods of data mining, such as clustering, discriminate analysis, decision trees, neural networks and support vector machines along with illustrative examples throughout the book to explain the theory of these models Recent methods such as bagging and boosting, decision trees, neural networks, support vector machines and genetic algorithm are also discussed along with their advantages and disadvantages Key Features: Presents a comprehensive introduction to all techniques used in data mining and statistical learning Includes coverage of data mining with R as well as a thorough comparison of the two industry leaders, SAS and SPSS Gives practical tips for data mining implementation as well as the latest techniques and state of the art theory Looks at a range of methods, tools and applications, such as scoring to web mining and text mining and presents their advantages and disadvantages Supported by an accompanying website hosting datasets and user analysis Business intelligence analysts and statisticians, compliance and financial experts in both commercial and government organizations across all industry sectors will benefit from this book DATA MINING AND STATISTICS FOR DECISION MAKING DATA MINING AND STATISTICS FOR DECISION MAKING Tufféry W I L E Y S E R I E S I N C O M P U TAT I O N A L S TAT I S T I C S Stéphane Tufféry DATA MINING AND STATISTICS FOR DECISION MAKING www.wiley.com/go/decision_making Red box rules are for proof stage only Delete before final printing Data Mining and Statistics for Decision Making Wiley Series in Computational Statistics Consulting Editors: Paolo Giudici University of Pavia, Italy Geof H Givens Colorado State University, USA Bani K Mallick Texas A&M University, USA Wiley Series in Computational Statistics is comprised of practical guides and cutting edge research books on new developments in computational statistics It features quality authors with a strong applications focus The texts in the series provide detailed coverage of statistical concepts, methods and case studies in areas at the interface of statistics, computing, and numerics With sound motivation and a wealth of practical examples, the books show in concrete terms how to select and to use appropriate ranges of statistical computing techniques in particular fields of study Readers are assumed to have a basic understanding of introductory terminology The series concentrates on applications of computational methods in statistics to fields of bioinformatics, genomics, epidemiology, business, engineering, finance and applied statistics Titles in the Series Biegler, Biros, Ghattas, Heinkenschloss, Keyes, Mallick, Marzouk, Tenorio, Waanders, Willcox – Large-Scale Inverse Problems and Quantification of Uncertainty Billard and Diday – Symbolic Data Analysis: Conceptual Statistics and Data Mining Bolstad – Understanding Computational Bayesian Statistics Borgelt, Steinbrecher and Kruse – Graphical Models, 2e Dunne – A Statistical Approach to Neutral Networks for Pattern Recognition Liang, Liu and Carroll – Advanced Markov Chain Monte Carlo Methods Ntzoufras – Bayesian Modeling Using WinBUGS Data Mining and Statistics for Decision Making Ste´phane Tuffe´ry University of Rennes, France Translated by Rod Riesco First published under the title ‘Data Mining et Statistique Decisionnelle’ by Editions Technip Ó Editions Technip 2008 All rights reserved Authorised translation from French language edition published by Editions Technip, 2008 This edition first published 2011 Ó 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988 All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought Library of Congress Cataloging-in-Publication Data Tuffery, Stephane Data mining and statistics for decision making / Stephane Tuffery p cm – (Wiley series in computational statistics) Includes bibliographical references and index ISBN 978-0-470-68829-8 (hardback) Data mining Statistical decision I Title QA76.9.D343T84 2011 006.3’12–dc22 2010039789 A catalogue record for this book is available from the British Library Print ISBN: 978-0-470-68829-8 ePDF ISBN: 978-0-470-97916-7 oBook ISBN: 978-0-470-97917-4 ePub ISBN: 978-0-470-97928-0 Typeset in by 10/12pt Times Roman by Thomson Digital, Noida, India to Paul and Nicole Tuffe´ry, with gratitude and affection Contents Preface xvii Foreword xxi Foreword from the French language edition List of trademarks xxiii xxv Overview of data mining 1.1 What is data mining? 1.2 What is data mining used for? 1.2.1 Data mining in different sectors 1.2.2 Data mining in different applications 1.3 Data mining and statistics 1.4 Data mining and information technology 1.5 Data mining and protection of personal data 1.6 Implementation of data mining 1 4 11 12 16 23 The development of a data mining study 2.1 Defining the aims 2.2 Listing the existing data 2.3 Collecting the data 2.4 Exploring and preparing the data 2.5 Population segmentation 2.6 Drawing up and validating predictive models 2.7 Synthesizing predictive models of different segments 2.8 Iteration of the preceding steps 2.9 Deploying the models 2.10 Training the model users 2.11 Monitoring the models 2.12 Enriching the models 2.13 Remarks 2.14 Life cycle of a model 2.15 Costs of a pilot project 25 26 26 27 30 33 35 36 37 37 38 38 40 41 41 41 Data exploration and preparation 3.1 The different types of data 3.2 Examining the distribution of variables 3.3 Detection of rare or missing values 3.4 Detection of aberrant values 3.5 Detection of extreme values 43 43 44 45 49 52 viii CONTENTS 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 Tests of normality Homoscedasticity and heteroscedasticity Detection of the most discriminating variables 3.8.1 Qualitative, discrete or binned independent variables 3.8.2 Continuous independent variables 3.8.3 Details of single-factor non-parametric tests 3.8.4 ODS and automated selection of discriminating variables Transformation of variables Choosing ranges of values of binned variables Creating new variables Detecting interactions Automatic variable selection Detection of collinearity Sampling 3.15.1 Using sampling 3.15.2 Random sampling methods 52 58 59 60 62 65 70 73 74 81 82 85 86 89 89 90 Using commercial data 4.1 Data used in commercial applications 4.1.1 Data on transactions and RFM data 4.1.2 Data on products and contracts 4.1.3 Lifetimes 4.1.4 Data on channels 4.1.5 Relational, attitudinal and psychographic data 4.1.6 Sociodemographic data 4.1.7 When data are unavailable 4.1.8 Technical data 4.2 Special data 4.2.1 Geodemographic data 4.2.2 Profitability 4.3 Data used by business sector 4.3.1 Data used in banking 4.3.2 Data used in insurance 4.3.3 Data used in telephony 4.3.4 Data used in mail order 93 93 93 94 94 96 96 97 97 98 98 98 105 106 106 108 108 109 Statistical and data mining software 5.1 Types of data mining and statistical software 5.2 Essential characteristics of the software 5.2.1 Points of comparison 5.2.2 Methods implemented 5.2.3 Data preparation functions 5.2.4 Other functions 5.2.5 Technical characteristics 5.3 The main software packages 5.3.1 Overview 111 111 114 114 115 116 116 117 117 117 ... What is data mining? 1.2 What is data mining used for? 1.2.1 Data mining in different sectors 1.2.2 Data mining in different applications 1.3 Data mining and statistics 1.4 Data mining and information... information Foreword It is a real pleasure to be invited to write the foreword to the English translation of Stephane Tuffery’s book Data Mining and Statistics for Decision Making Data mining. .. ever, this book covers all the essentials (and more) needed for a clear understanding and proper application of data mining and statistics for decision making Among the new features in this edition,