1. Trang chủ
  2. » Tất cả

IT Training Applied Data Mining_ Statistical Methods for Business and Industry [Giudici 2003-10-17]

378 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 378
Dung lượng 4 MB

Nội dung

Applied Data Mining Statistical Methods for Business and Industry PAOLO GIUDICI Faculty of Economics University of Pavia Italy Applied Data Mining Applied Data Mining Statistical Methods for Business and Industry PAOLO GIUDICI Faculty of Economics University of Pavia Italy Copyright  2003 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone (+44) 1243 779777 Email (for orders and customer service enquiries): cs-books@wiley.co.uk Visit our Home Page on www.wileyeurope.com or www.wiley.com All Rights Reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to (+44) 1243 770620 This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the Publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Library of Congress Cataloging-in-Publication Data Giudici, Paolo Applied data mining : statistical methods for business and industry / Paolo Giudici p cm Includes bibliographical references and index ISBN 0-470-84678-X (alk paper) – ISBN 0-470-84679-8 (pbk.) Data mining Business – Data processing Commercial statistics I Title QA76.9.D343G75 2003 2003050196 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-470-84678-X (Cloth) ISBN 0-470-84679-8 (Paper) Typeset in 10/12pt Times by Laserwords Private Limited, Chennai, India Printed and bound in Great Britain by Biddles Ltd, Guildford, Surrey This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production Contents Preface Introduction 1.1 What is data mining? 1.1.1 Data mining and computing 1.1.2 Data mining and statistics 1.2 The data mining process 1.3 Software for data mining 1.4 Organisation of the book 1.4.1 Chapters to 6: methodology 1.4.2 Chapters to 12: business cases 1.5 Further reading Part I Methodology xi 1 11 12 13 13 14 17 Organisation of the data 2.1 From the data warehouse to the data marts 2.1.1 The data warehouse 2.1.2 The data webhouse 2.1.3 Data marts 2.2 Classification of the data 2.3 The data matrix 2.3.1 Binarisation of the data matrix 2.4 Frequency distributions 2.4.1 Univariate distributions 2.4.2 Multivariate distributions 2.5 Transformation of the data 2.6 Other data structures 2.7 Further reading 19 20 20 21 22 22 23 25 25 26 27 29 30 31 Exploratory data analysis 3.1 Univariate exploratory analysis 3.1.1 Measures of location 3.1.2 Measures of variability 3.1.3 Measures of heterogeneity 33 34 35 37 37 vi CONTENTS 3.2 3.3 3.4 3.5 3.6 3.1.4 Measures of concentration 3.1.5 Measures of asymmetry 3.1.6 Measures of kurtosis Bivariate exploratory analysis Multivariate exploratory analysis of quantitative data Multivariate exploratory analysis of qualitative data 3.4.1 Independence and association 3.4.2 Distance measures 3.4.3 Dependency measures 3.4.4 Model-based measures Reduction of dimensionality 3.5.1 Interpretation of the principal components 3.5.2 Application of the principal components Further reading Computational data mining 4.1 Measures of distance 4.1.1 Euclidean distance 4.1.2 Similarity measures 4.1.3 Multidimensional scaling 4.2 Cluster analysis 4.2.1 Hierarchical methods 4.2.2 Evaluation of hierarchical methods 4.2.3 Non-hierarchical methods 4.3 Linear regression 4.3.1 Bivariate linear regression 4.3.2 Properties of the residuals 4.3.3 Goodness of fit 4.3.4 Multiple linear regression 4.4 Logistic regression 4.4.1 Interpretation of logistic regression 4.4.2 Discriminant analysis 4.5 Tree models 4.5.1 Division criteria 4.5.2 Pruning 4.6 Neural networks 4.6.1 Architecture of a neural network 4.6.2 The multilayer perceptron 4.6.3 Kohonen networks 4.7 Nearest-neighbour models 4.8 Local models 4.8.1 Association rules 4.8.2 Retrieval by content 4.9 Further reading 39 41 43 45 49 51 53 54 56 58 61 63 65 66 69 70 71 72 74 75 77 81 83 85 85 88 90 91 96 97 98 100 103 105 107 109 111 117 119 121 121 126 127 CONTENTS vii Statistical data mining 5.1 Uncertainty measures and inference 5.1.1 Probability 5.1.2 Statistical models 5.1.3 Statistical inference 5.2 Non-parametric modelling 5.3 The normal linear model 5.3.1 Main inferential results 5.3.2 Application 5.4 Generalised linear models 5.4.1 The exponential family 5.4.2 Definition of generalised linear models 5.4.3 The logistic regression model 5.4.4 Application 5.5 Log-linear models 5.5.1 Construction of a log-linear model 5.5.2 Interpretation of a log-linear model 5.5.3 Graphical log-linear models 5.5.4 Log-linear model comparison 5.5.5 Application 5.6 Graphical models 5.6.1 Symmetric graphical models 5.6.2 Recursive graphical models 5.6.3 Graphical models versus neural networks 5.7 Further reading 129 129 130 132 137 143 146 147 150 154 155 157 163 164 167 167 169 171 174 175 177 178 182 184 185 Evaluation of data mining methods 6.1 Criteria based on statistical tests 6.1.1 Distance between statistical models 6.1.2 Discrepancy of a statistical model 6.1.3 The Kullback–Leibler discrepancy 6.2 Criteria based on scoring functions 6.3 Bayesian criteria 6.4 Computational criteria 6.5 Criteria based on loss functions 6.6 Further reading 187 188 188 190 192 193 195 197 200 204 Part II Business cases Market basket analysis 7.1 Objectives of the analysis 7.2 Description of the data 7.3 Exploratory data analysis 7.4 Model building 7.4.1 Log-linear models 7.4.2 Association rules 207 209 209 210 212 215 215 218 viii CONTENTS 7.5 Model comparison 7.6 Summary report clickstream analysis Objectives of the analysis Description of the data Exploratory data analysis Model building 8.4.1 Sequence rules 8.4.2 Link analysis 8.4.3 Probabilistic expert systems 8.4.4 Markov chains 8.5 Model comparison 8.6 Summary report 224 226 Web 8.1 8.2 8.3 8.4 229 229 229 232 238 238 242 244 245 250 252 Profiling website visitors 9.1 Objectives of the analysis 9.2 Description of the data 9.3 Exploratory analysis 9.4 Model building 9.4.1 Cluster analysis 9.4.2 Kohonen maps 9.5 Model comparison 9.6 Summary report 255 255 255 258 258 258 262 264 271 10 Customer relationship management 10.1 Objectives of the analysis 10.2 Description of the data 10.3 Exploratory data analysis 10.4 Model building 10.4.1 Logistic regression models 10.4.2 Radial basis function networks 10.4.3 Classification tree models 10.4.4 Nearest-neighbour models 10.5 Model comparison 10.6 Summary report 273 273 273 275 278 278 280 281 285 286 290 11 Credit scoring 11.1 Objectives of the analysis 11.2 Description of the data 11.3 Exploratory data analysis 11.4 Model building 11.4.1 Logistic regression models 11.4.2 Classification tree models 11.4.3 Multilayer perceptron models 293 293 294 296 299 299 303 314 ... Applied Data Mining Applied Data Mining Statistical Methods for Business and Industry PAOLO GIUDICI Faculty of Economics University of Pavia Italy Copyright  2003 John... increasing availability of data in the current information society has led to the need for valid tools for its modelling and analysis Data mining and applied statistical methods are the appropriate... results with a smaller campaign effort Data mining is different from data retrieval because it looks for relations and associations between phenomena that are not known beforehand It also allows APPLIED

Ngày đăng: 05/11/2019, 13:07