07 berry, michael j a linoff, gordon s data mining techniques SE

AM FL Y www.elsolucionario.net TE www.elsolucionario.net Data Mining Techniques For Marketing, Sales, and Customer Relationship Management Second Edition Michael J.A Berry Gordon S Linoff www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net Data Mining Techniques For Marketing, Sales, and Customer Relationship Management Second Edition Michael J.A Berry Gordon S Linoff www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net Vice President and Executive Group Publisher: Richard Swadley Vice President and Executive Publisher: Bob Ipsen Vice President and Publisher: Joseph B Wikert Executive Editorial Director: Mary Bednarek Executive Editor: Robert M Elliott Editorial Manager: Kathryn A Malm Senior Production Editor: Fred Bernardi Development Editor: Emilie Herman, Erica Weinstein Production Editor: Felicia Robinson Media Development Specialist: Laura Carpenter VanWinkle Text Design & Composition: Wiley Composition Services Copyright  2004 by Wiley Publishing, Inc., Indianapolis, Indiana All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8700 Requests to the Pub lisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4447, E-mail: permcoordinator@wiley.com Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales mate rials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Trademarks: Wiley, the Wiley Publishing logo, are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates in the United States and other countries All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Library of Congress Cataloging-in-Publication Data: Berry, Michael J A Data mining techniques : for marketing, sales, and customer relationship management / Michael J.A Berry, Gordon Linoff.— 2nd ed p cm Includes index ISBN 0-471-47064-3 (paper/website) Data mining Marketing—Data processing Business—Data processing I Linoff, Gordon II Title HF5415.125 B47 2004 658.8’02—dc22 2003026693 ISBN: 0-471-47064-3 Printed in the United States of America 10 www.elsolucionario.net Published simultaneously in Canada www.elsolucionario.net — Michael To Puccio Grazie per essere paziente me Ti amo — Gordon www.elsolucionario.net To Stephanie, Sasha, and Nathaniel Without your patience and understanding, this book would not have been possible www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net Acknowledgments We are fortunate to be surrounded by some of the most talented data miners anywhere, so our first thanks go to our colleagues at Data Miners, Inc from whom we have learned so much: Will Potts, Dorian Pyle, and Brij Masand There are also clients with whom we work so closely that we consider them our colleagues as well: Harrison Sohmer and Stuart E Ward, III are in that cat egory Our Editor, Bob Elliott, Editorial Assistant, Erica Weinstein, and Devel opment Editor, Emilie Herman, kept us (more or less) on schedule and helped us maintain a consistent style Lauren McCann, a graduate student at M.I.T and intern at Data Miners, prepared the census data used in some examples and created some of the illustrations We would also like to acknowledge all of the people we have worked with in scores of data mining engagements over the years We have learned some thing from every one of them The many whose data mining projects have influenced the second edition of this book include: Al Fan Herb Edelstein Nick Gagliardo Alan Parker Jill Holtz Nick Radcliffe Anne Milley Joan Forrester Patrick Surry Brian Guscott John Wallace Ronny Kohavi Bruce Rylander Josh Goff Sheridan Young Corina Cortes Karen Kennedy Susan Hunt Stevens Daryl Berry Kurt Thearling Ted Browne Daryl Pregibon Lynne Brennen Terri Kowalchuk Doug Newell Mark Smith Victor Lo Ed Freeman Mateus Kehder Yasmin Namini Erin McCarthy Michael Patrick Zai Ying Huang xix Acknowledgments www.elsolucionario.net And, of course, all the people we thanked in the first edition are still deserv ing of acknowledgement: Bob Flynn Jim Flynn Paul Berry Bryan McNeely Kamran Parsaye Rakesh Agrawal Claire Budden Karen Stewart Ric Amari David Isaac Larry Bookman Rich Cohen David Waltz Larry Scroggins Robert Groth Dena d’Ebin Lars Rohrberg Robert Utzschnieder Diana Lin Lounette Dyer Roland Pesch Don Peppers Marc Goodman Stephen Smith Ed Horton Marc Reifeis Sue Osterfelt Edward Ewen Marge Sherold Susan Buchanan Fred Chapman Mario Bourgoin Syamala Srinivasan Gary Drescher Prof Michael Jordan Wei-Xing Ho Gregory Lampshire Patsy Campbell William Petefish Janet Smith Paul Becker Yvonne McCollin Jerry Modes www.elsolucionario.net xx MBR (memory-based reasoning), 262–263 neural networks, 219 predication tasks, 10 hobbies, house-hold level data, 96 holdout groups, marketing campaigns, 106 home-based businesses, 56 house-hold level data, 96 hubs, link analysis, 332–334 hyperbolic tangent function, 223 hypothesis testing confidence levels, 148 considerations, 51 decision-making process, 50–51 generating, 51 market basket analysis, 51 null hypothesis, statistics and, 125–126 I IBM, relational database management software, 13 ID and key variables, 554 ID3 (Iteractive Dichotomiser 3), 190 identification columns, 548 customer signatures, 560–562 good prospects, 88–89 problem management, 43 proof-of-concept projects, 599–601 identified versus anonymous transactions, association rules, 308 identity distance, distance function, 271 ignored columns, 547 images, binary data, 557 imperfections, in data, 34 implementation neural networks, 212 proof-of-concept projects, 601–605 implicit parallelism, 438 in-between relationships, customer relationships, 453 income, house-hold level data, 96 Index inconclusive survey responses, 46 inconsistent data, 593–594 index-based scores, 92–95 indicator variables, 554 indirect relationships, customer relationships, 453–454 industry revolution, 18 inexplicable rules, association rules, 297–298 information competitive advantages, 14 data as, 22 infomediaries, 14 information brokers, supermarket chains as, 15–16 information gain, entropy, 178–180 information technology, data transformation, 58–60 as products, 14 recommendation-based businesses, 16–17 Inmon, Bill (Building the Data Warehouse), 474 input columns, 547 input layer, free-forward neural networks, 226 input variables, target fields, 37 inputs/outputs, neural networks, 215 insourcing data mining, 524–525 insurance claims, classification, interactive systems, response times, 33 Internet resources customer response to marketing campaigns, tracking, 109 RuleQuest, 190 U.S Census Bureau, 94 interval variables, 549, 552 interviews business opportunities, identifying, 27 proof-of-concept projects, 600 intrinsic information, splits, decision trees, 180 introduction, of products, 27 629 www.elsolucionario.net www.elsolucionario.net Index www.elsolucionario.net intuition, data exploration, 65 involuntary churn, 118–119, 521 item popularity, market based analysis, 293 item sets, market based analysis, 289 Iterative Dichotomiser (ID3), 190 K key and ID variables, 554 KDD (knowledge discovery in databases), Kimball, Ralph (The Data Warehouse Toolkit), 474 Kleinberg algorithm, link analysis, 332–333 K-means clustering, 354–358 knowledge discovery in databases (KDD), Kolmogorov-Smirnov (KS) tests, 101 L large-business relationships, customer relationship management, 3–4 leaf nodes, classification, 167 learning opportunities, customer interactions, 520–521 supervised, 57 training techniques as, 231 truthful sources, 48–50 unsupervised, 57 untruthful sources, 44–48 life stages, customer relationships, 455–456 lifetime customer value, customer relationships, 32 lift ratio comparing models using, 81–82 lift charts, 82, 84 problems with, 83 linear processes, 55 linear regression, 139 link analysis authorities, 333–334 candidates, 333 case study, 343–346 classification, discussed, 321 fax machines, 337–341 graphs acyclic graphs, 331 communities of interest, 346 cyclic, 330–331 data as, 340 directed graphs, 330 edges, 322 graph-coloring algorithm, 340–341 Hamiltonian path, 328 nodes, 322 planar graphs, 323 traveling salesman problem, 327–329 vertices, 322 hubs, 332–334 Kleinberg algorithm, 332–333 root sets, 333 search programs, 331 stemming, 333 weighted graphs, 322, 324 linkage graphs, 77 lists, ordered and unordered, 239 literature, market research, 22 logarithms, data transformation, 74 logical schema, OLAP, 478 logistic methods, box diagrams, 200 long form, census data, 94 long-term trends, 75 lookup tables, auxiliary information, 570–571 loyalty customers, 520 loyalty programs marketing campaigns, 111 welcome periods, 518 luminosity, 351 M mailings marketing campaigns, 97 non-response models, 35 www.elsolucionario.net 630 marginal customers, 553 market based analysis differentiation, 289 discussed, 287 geographic attributes, 293 item popularity, 293 item sets, 289 market basket data, 51, 289–291 marketing interventions, tracking, 293–294 order characteristics, 292 products, clustering by usage, 294–295 purchases, 289 support, 301 telecommunications customers, 288 time attributes, 293 market research control group response versus, 38 literature, 22 shortcomings, 25 survey-based, 113 marketing campaigns See also advertising acquisitions-time data, 108–110 canonical measurements, 31 champion-challenger approach, 139 credit risks, reducing exposure to, 113–114 cross-selling, 115–116 customer response, tracking, 109 customer segmentation, 111–113 differential response analysis, 107–108 discussed, 95 fixed budgets, 97–100 loyalty programs, 111 new customer information, gathering, 109–110 people most influenced by, 106–107 planning, 27 profitability, 100–104 proof-of-concept projects, 600 response modeling, 96–97 Index as statistical analysis acuity of testing, 147–148 confidence intervals, 146 proportion, standard error of, 139–141 results, comparing, using confi dence bounds, 141–143 sample sizes, 145 targeted acquisition campaigns, 31 types of, 111 up-selling, 115–116 usage stimulation, 111 marriages categorical values, 239–240 house-hold level data, 96 mass intimacy, customer relationships, 451–453 massively parallel processor (MPP), 485 maximum values, of simple functions, generic algorithms, 424 MBR See memory-based reasoning MDL (minimum description length), 78 mean between time failure (MTBF), 384 mean time to failure (MTTF), 384 mean values, statistics, 137 measurement errors, 159 median customer lifetime value, retention, 387 median values, statistics, 137 medical insurance claims, useful data sources, 60 medical treatment applications, MBR, 258 meetings, brainstorming, 37 memory-based reasoning (MBR) case study, 259–262 challenges of, 262–265 classification codes, 266, 273–274 combination function, 258, 265 customer classification, 90–91 customer response prediction, 258 631 www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net missing data data correction, 73–74 NULL values, 590 splits, decision trees, 174–175 mission-critical applications, 32 mode values, statistics, 137 models assessing classifiers and predictors, 79 descriptive models, 78 directed models, 78–79 estimators, 79–81 building, 8, 77 comparing, using lift ratio, 81–82 deploying, 84–85 model sets balanced datasets, 68 components of, 52 customer signatures, assembling, 68 partitioning, 71–72 predictive models, 70–71 timelines, multiple, 70 non-response, mass mailings, 35 score sets, 52 motor vehicle registration records, useful data sources, 61 MOU (minutes of use), wireless communications industries, 38 MPP (massively parallel processor), 485 MSA (metropolitan statistical area), 94 MTBF (mean between time failure), 384 MTTF (mean time to failure), 384 multiway splits, decision trees, 171 mutation, generic algorithms, 431–432 AM FL Y memory-based reasoning (MBR) (continued) democracy approach, 279–281 distance function, 258, 265, 271–272 fraud detection, 258 free text response, 258 historical records, selecting, 262–263 medical treatment applications, 258 new customers, 277 relevance feedback, 267–268 similarity measurements, 271–272 training data, 263–264 weighted voting, 281–282 men, differential response analysis and, 107 messages, prospecting, 89–90 metadata repository, 484, 491 methodologies data correction, 72–74 data exploration, 64–68 data mining process, 54–55 data selection, 60–64 data transformation, 74–76 data translation, 56–60 learning sources truthful, 48–50 untruthful, 44–48 model assessment, 78–82 model building, 77 model deployment, 84–85 model sets, creating, 68–72 reasons for, 44 results, assessing, 85 metropolitan statistical area (MSA), 94 minimum description length (MDL), 78 minimum support pruning, decision trees, 312 minutes of use (MOU), wireless communications industries, 38 misclassification rates, binary classification, 98 N N variables, dimension, 352 National Consumer Assets Group (NCAG), 23 natural association, automatic cluster detection, 358 Team-Fly® www.elsolucionario.net Index TE 632 nearest neighbor techniques classification, collaborative filtering estimated ratings, 284–285 grouping customers, 90 predictions, 284–285 profiles, building and comparing, 283–284 social information filtering, 282 word-of-mouth advertising, 283 memory-based reasoning (MBR) case study, 259–262 challenges of, 262–265 classification codes, 266, 273–274 combination function, 258, 265 customer classification, 90–91 customer response prediction, 258 democracy approach, 279–281 distance function fraud detection, 258 free text responses, 258 historical records, selecting, 262–263 medical treatment applications, 258 new customers, 277 relevance feedback, 267–268 similarity measurements, 271–272 training data, 263–264 weighted voting, 281–282 negative correlation, 139 neighborliness parameters, neural networks, 250 neural networks activation function, 222 AND value, 222 automation, 213 average member technique, 252 bias sampling, 227 biological, 211 building models, case study, 252–254 categorical variables, 239–240 Index classification, combination function, 222 components of, 220–221 continuous values, features with, 235–237 coverage of values, 232–233 data preparation categorical values, 239–240 continuous values, 235–237 decision trees, 199 discussed, 211 estimation tasks, 10, 215 feed-forward back propagation, 228–232 hidden layer, 227 input layer, 226 output layer, 227 generic algorithms and, 439–440 hidden layers, 221, 227 historical data, 219 history of, 212–213 implementation, 212 inputs/outputs, 215 neighborliness parameters, 250 nonlinear behaviors, 222 OR value, 222 overfitting, 234 parallel coordinates, 253 prediction, 215 real estate appraisal example, 213–217 results, interpreting, 241–243 sensitivity analysis, 247–248 sigmoid action functions, 225 SOM (self-organizing map), 249–251 time series analysis, 244–247 training sets, selection consideration, 232–234 transfer function, 223 validation sets, 218 variable selection problem, 233 variance, 199 633 www.elsolucionario.net www.elsolucionario.net Index www.elsolucionario.net new customer information gathering, 109–110 memory-based reasoning, 277 profiles, building, 283 new start forecast (NSF), 469 nodes, graphs, 322 nonlinear behaviors, neural networks, 222 non-response models, mass mailings, 35 normal distribution, statistics, 130–132 normalization, numeric variables, 550 normalized absolute value, distance function, 275 NORMDIST function, 134 NORMSINV function, 147 NSF (new start forecast), 469 null hypothesis, statistics and, 125–126 NULL values, missing data, 590 numeric variables data correction, 73 distance function, 275 measure of, 550–551 splits, decision trees, 173 O Occam’s Razor, 124–125 ODBC (Open Database Connectivity), 496 one-tailed distribution, 134 Online Analytic Processing (OLAP) additive facts, 501 data mining and, 507–508 decision-support summary data, 477–478 dimension tables, 502–503 discussed, 31 levels of, 475 logical schema, 478 metadata, 483–484, 491 operational summary data, 477 physical schema, 478 reporting requirements, 495–496 transaction data, 476–477 Open Database Connectivity (ODBC), 496 operational errors, 159 operational feedback, 485, 492 operational summary data, OLAP, 477 opportunistic sample, defined, 25 opportunities, good response scores, 34 optimization generic algorithms, 422 resources, generic algorithms, 433–435 training as, 230 OR value, neural networks, 222 Oracle, relational database management software, 13 order characteristics, market based analysis, 292 ordered lists, 239 ordered variables, measure of, 549 organizations See businesses out of time tests, 72 outliners data correction, 73 data transformation, 74 output layer, feed-forward neural networks, 227 outputs, neural networks, 215 outsourcing data mining, 522–524 overfitting, neural networks, 234 P parallel coordinates, neural networks, 253 parsing variables, 569 patterns meaningful discoveries, 56 prediction, 45 untruthful learning sources, 45–46 peg values, 236 penetration, proportion, 203 percent variations, 105 perceptrons, defined, 212 www.elsolucionario.net 634 performance, classification, 12 physical schema, OLAP, 478 pilot projects, 598 planar graphs, 323 planned processes, proof-of-concept projects, 599 platforms, data mining, 527 point of maximum benefit, 101 point-of-sale data association rules, 288 scanners, as useful data source, 60 population diversity, 178 positive ratings, voting, 284 postcards, as communication channel, 89 potential revenue, behavior-based variables, 583–585 precision measurements, classification codes, 273–274 preclassified tests, 79 predictions accuracy, 79 association rules, 70 business goals, formulating, 605 collaborative filtering, 284–285 credit risks, 113–114 customer longevity, 119–120 data transformation, 57 defined, 52 directed data mining, 57 errors, 191 future behaviors, 10 historical data, 10 model sets for, 70–71 neural networks, 215 patterns, 45 prediction task examples, 10 profiling versus, 52–53 response, MBR, 258 uses for, 54 probabilities calculating, 309 class labels, 85 Index distribution and, 135 hazards, 394–396 statistics, 133–135 probation periods, 518 problem management data transformation, 56–57 identification, 43 lift ratio, 83 profiling as, 53–54 rule-oriented problems, 176 variable selection problems, neural networks, 233 products clustering by usage, market based analysis, 294–295 co-occurrence of, 299 hierarchical categories, 305 information as, 14 introduction, planning for, 27 product codes, as categorical value, 239 product-focused businesses, taxonomy, 305 profiling business goals, formulating, 605 collaborative filtering, 283–284 data transformation, 57 decision trees, 12 demographic profiles, 31 descriptive, 52 directed, 52 examples of, 54 gender example, 12 new customer information, 283 overview, 12 predication versus, 52–53 as problem management, 53–54 survey response, 53 profitability marketing campaigns, 100–104 proof-of concept projects, 599 results, assessing, 85 projective visualization (Marc Goodman), 206–208 635 www.elsolucionario.net www.elsolucionario.net Index www.elsolucionario.net proof-of-concept projects expectations, 599 identifying, 599–601 implementation, 601–605 propensity categorical variables, 242 propensity-to-respond score, 97 proportion converting counts to, 75–76 difference of proportion chi-square tests versus, 153–154 statistical analysis, 143–144 penetration, 203 standard error of, statistical analysis, 139–141 proportional hazards Cox, 410–411 discussed, 408 examples of, 409 limitations of, 411–412 proportional scoring, census data, 94–95 prospecting advertising techniques, 90–94 communication channels, 89 customer relationships, 457 efforts, 90 good prospects, identifying, 88–89 index-based scores, 92–95 marketing campaigns acquisition-time variables, 110 credit risks, reducing exposure to, 113–114 cross-selling, 115–116 customer response, tracking, 109 customer segmentation, 111–113 differential response analysis, 107–108 discussed, 95 fixed budgets, 97–100 new customer information, gathering, 109–110 people most influenced by, 106–107 planning, 27 profitability, 100–104 response modeling, 96–97 types of, 111 up-selling, 115–116 messages, selecting appropriate, 89–90 ranking, 88–89 roles in, 88 targeting, 88 time dependency and, 160 prospective customer value, 115 prototypes, proof-of-concept projects, 599 pruning, decision trees C5 algorithm, 190–191 CART algorithm, 185, 188–189 discussed, 184 minimum support pruning, 312 stability-based, 191–192 public records, house-hold level data, 96 publications Building the Data Warehouse (Bill Inmon), 474 Business Modeling and Data Mining (Dorian Pyle), 60 Data Preparation for Data Mining (Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph Kimball), 474 Genetic Algorithms in Search, Optimization, and Machine Learning (Goldberg), 445 purchases, market based analysis, 289 purchasing frequencies, behavior- based variables, 575–576 purity measures, splitting criteria, decision trees, 177–178 p-values, statistics, 126 Pyle, Dorian Business Modeling and Data Mining, 60 Data Preparation for Data Mining, 75 www.elsolucionario.net 636 Q quadratic discriminates, box diagrams, 200 quality of data, association rules, 308 question asking, data exploration, 67–68 Quinlan, J Ross (Iterative Dichotomiser 3), 190 q-values, statistics, 126 R range values, statistics, 137 rate plans, wireless telephone services, ratios data transformation, 75 lift ratio, 81–84 RDBMS See relational database management system real estate appraisals, neural network example, 213–217 recall measurements, classification codes, 273–274 recency, frequency, and monetary (RFM) value, 575 recommendation-based businesses, 16–17 records combining values within, 569 default classes, 194 transactional, 574 rectangular regions, decision trees, 197 recursive algorithms, 173 reduction in variance, splits, decision trees, 183 regression building models, estimation tasks, 10 linear, 139 regression trees, 170 statistics, 139 techniques, generic algorithms, 423 Index relational database management system (RDBMS) discussed, 474 source systems, 594–595 star schema, 505 suppliers, 13 support, 511 relevance feedback, MBR, 267–268 replicating results, 33 reporting requirements, OLAP, 495–496 resources geographical, 555–556 optimization, generic algorithms, 433–435 response biased sampling, 146 communication channels, 89 control groups market research versus, 38 marketing campaigns, 106 cumulative response concentration, 82–83 results, assessing, 85 customer relationships, 457 differential response analysis, marketing campaigns, 107–108 erroneous conclusions, 74 free text, 285 good response scores, 34 marketing campaigns, 96–97 prediction, MBR, 258 proof-of-concept projects, 599 response models generic algorithms, 440–443 prospects, ranking, 36 response times, interactive systems, 33 sample sizes, 145 single response rates, 141 survey response customer classification, 91 inconclusive, 46 637 www.elsolucionario.net www.elsolucionario.net Index www.elsolucionario.net response, survey response (continued) profiling, 53 survey-based market research, 113 useful data sources, 61 results actionable, 22 assessing, 85 comparing expectations to, 31 deliverables, data transformation, 57–58 measuring, virtuous cycle, 30–32 neural networks, 241–243 replicating, 33 statistical analysis, 141–143 tainted, 72 retention calculating, 385–386 churn and, 116–120 customer relationships, 467–469 exponential decay, 389–390, 393 hazards, 404–405 median customer lifetime value, 387 retention curves, 386–389 truncated mean lifetime value, 389 retrospective customer value, 115 revenue, behavior-based variables, 581–585 revolvers, behavior-based variables, 580 RFM (recency, frequency, and monetary) value, 575 ring diagrams, as alternative to decision trees, 199–201 risks hazards, 403 proof-of-concept projects, 599 ROC curves, 98–99, 101 root sets, link analysis, 333 RuleQuest Web site, 190 rules association rules actionable rules, 296 affinity grouping, 11 anonymous versus identified transactions, 308 data quality, 308 dissociation rules, 317 effectiveness of, 299–301 inexplicable rules, 297–298 point-of-sale data, 288 practical limits, overcoming, 311–313 prediction, 70 probabilities, calculating, 309 products, hierarchical categories, 305 sequential analysis, 318–319 for store comparisons, 315–316 trivial rules, 297 virtual items, 307 decision trees, 193–194 generalized delta, 229 rule-oriented problems, 176 S SAC (Simplifying Assumptions Corporation), 97, 100 sample sizes, statistical analysis, 145 sample variation, statistics, 129 SAS Enterprise Miner Tree Viewer tool, 167–168 scalability, data mining, 533–534 scaling, automatic cluster detection, 363–364 scanners, point-of-sale, scarce data, 62 SCF (sectional center facility), 553 schemata, generic algorithms, 434, 436–438 scores bizocity, 112–113 cutoff, 98 decision trees, 169–170 good response, 34 index-based, 92–95 model deployment, 84–85 propensity-to-respond, 97 proportional, census data, 94–95 score sets, 52 scoring platforms, data mining, 527–528 www.elsolucionario.net 638 sorting customers by, z-scores, 551 search programs, link analysis, 331 searchable criteria, relevance feedback, 268 sectional center facility (SCF), 553 selection step, generic algorithms, 429 self-organizing map (SOM), 249–251, 372 sensitivity analysis, neural networks, 247–248 sequential analysis, association rules, 318–319 sequential events, applying decision trees to, 205 sequential patterns, identifying, 24 server platforms, affordability, 13 service business sectors, customer relationships, 13–14 shared labels, fax machines, 341 short form, census data, 94 short-term trends, 75 sigmoid action functions, neural networks, 225 signatures, customers assembling, 68 business versus residential customers, 561 columns, pivoting, 563 computational issues, 594–596 considerations, 564 customer identification, 560–562 data for, cataloging, 559–560 discussed, 540–541 model set creation, 68 snapshots, 562 time frames, identifying, 562 similarity and distance, automatic cluster detection, 359–363 similarity matrix, 368 similarity measurements, MBR, 271–272 Simplifying Assumptions Corporation (SAC), 97, 100 Index simulated annealing, 230 single linkage, automatic cluster detection, 369 single response rates, 141 single views, customers, 517–518 sites See Web sites skewed distributions, data correction, 73 SKUs (stock-keeping units), 305 small-business relationships, customer relationship management, SMP (symmetric multiprocessor), 485 snapshots, customer signatures, 562 social information filtering, 282 soft clustering, automatic cluster detection, 367 SOI (sphere of influence), 38 sole proprietors, solicitation, marketing campaigns, 96 SOM (self-organizing map), 249–251, 372 source systems, 484, 486–487, 594 special-purpose code, 595 sphere of influence (SOI), 38 spiders, web crawlers, 331 splits, decision trees on categorical input variables, 174 chi-square testing, 180–183 discussed, 170 diversity measures, 177–178 entropy, 179 finding, 172 Gini splitting criterion, 178 information gain ratio, 178, 180 intrinsic information of, 180 missing values, 174–175 multiway, 171 on numeric input variables, 173 population diversity, 178 purity measures, 177–178 reduction in variance, 183 surrogate, 175 spreadsheets, results, assessing, 85 639 www.elsolucionario.net www.elsolucionario.net Index www.elsolucionario.net SQL data, time series analysis, 572–573 stability-based pruning, decision trees, 191–192 staffing, data mining, 525–526 standard deviation estimation, 81 statistics, 132, 138 variance and, 138 standard error of proportion, statistical analysis, 139–141 standardization, numeric values, 551 standardized values, statistics, 129–133 star schema structure, relational databases, 505 statistical analysis business data versus scientific data, 159 censored data, 161 Central Limit Theorem, 129–130 chi-square tests case study, 155–158 degrees of freedom values, chi-square tests, 152–153 difference of proportions versus, 153–154 discussed, 149 expected values, calculating, 150–151 continuous variables, 137–138 correlation ranges, 139 cross-tabulations, 136 density function, 133 as disciplinary technique, 123 discrete values, 127–131 experimentation, 160–161 field values, 128 histograms and, 127 marketing campaign approaches acuity of testing, 147–148 confidence intervals, 146 proportion, standard error of, 139–141 sample sizes, 145 mean values, 137 median values, 137 mode values, 137 multiple comparisons, 148–149 normal distribution, 130–132 null hypothesis and, 125–126 probabilities, 133–135 p-values, 126 q-values, 126 range values, 137 regression ranges, 139 sample variation, 129 standard deviation, 132, 138 standardized values, 129–133 sum of values, 137–138 time series analysis, 128–129 truncated data, 162 variance, 138 z-values, 131, 138 statistical regression techniques, generic algorithms, 423 status codes, as categorical value, 239 stemming, link analysis, 333 stock-keeping units (SKUs), 305 store comparisons, association rules for, 315–316 stratification customer relationships and, 469 hazards, 410 strings, fixed-length characters, 552–554 subgroups automatic cluster detection agglomerative clustering, 368–370 case study, 374–378 categorical variables, 359 centroid distance, 369 complete linkage, 369 data preparation, 363–365 dimension, 352 directed clustering, 372 discussed, 12, 91, 351 distance and similarity, 359–363 divisive clustering, 371–372 evaluation, 372–373 www.elsolucionario.net 640 Gaussian mixture model, 366–367 geometric distance, 360–361 hard clustering, 367 Hertzsprung-Russell diagram, 352–354 luminosity, 351 scaling, 363–364 single linkage, 369 soft clustering, 367 SOM (self-organizing map), 372 vectors, angles between, 361–362 weighting, 363–365 zone boundaries, adjusting, 380 business goals, formulating, 605 customer attributes, 11 data transformation, 57 overview, 11 profiling tasks, 12 undirected data mining, 57 subscription-based relationships, cus tomer relationships, 459–460 subtrees, decision trees, 189 sum of values, statistics, 137–138 summarization, data transformation, 44 summation function, 272 supermarket chains, as information brokers, 15–16 supervised learning, 57 support, market based analysis, 301 surrogate splits, decision trees, 175 survey responses customer classification, 91 inconclusive, 46 profiling, 53 survey-based market research, 113 useful data sources, 61 survival analysis attrition, handling different types of, 412–413 customer relationships, 413–415 estimation tasks, 10 forecasting, 415–416 symmetric multiprocessor (SMP), 489–490 Index T tables, lookup, auxiliary information, 570–571 tainted results, 72 tangent function, 223 target columns, 547 target fields, input variables, 37 target market versus control group response, 38 targeted acquisition campaigns, 31 targeting good prospects, identifying, 88–89 prospecting, 88 taxonomy, products, 305 telecommunications customers, market based analysis, 288 telephone switches, transaction processing systems, terabytes, Teradata, relational database management software, 13 termination of services, 114 testing acuity of, statistical analysis, 147–148 chi-square tests case study, 155–158 CHIDIST function, 152 degrees of freedom values, 152–153 difference of proportions versus, 153–154 discussed, 149 expected values, calculating, 150–151 splits, decision trees, 180–183 F tests, 183–184 hypothesis testing confidence levels, 148 considerations, 51 decision-making process, 50–51 generating, 51 market basket analysis, 51 null hypothesis, statistics and, 125–126 641 www.elsolucionario.net www.elsolucionario.net www.elsolucionario.net truncated mean lifetime value, retention, 389 truthful learning sources, 48–50 two-tailed distribution, 134 U undirected data mining affinity grouping, 57 clustering, 57 discussed, uniform distribution, statistics, 132 uniform product code (UPC), 555 UNIT_MASTER file, customer signatures, 559 unordered lists, 239 unsupervised learning, 57 untruthful learning sources, 44–48 UPC (uniform product code), 555 UPS, transaction processing systems, 3–4 up-selling customer relationships, 467 marketing campaigns, 111, 115–116 U.S Census Bureau Web site, 94 usage stimulation marketing campaigns, 111 user roles, data transformation, 58–60 AM FL Y testing (continued) KS (Kolmogorov-Smirnov) tests, 101 preclassified tests, 79 test groups, marketing campaigns, 106 test sets out of time tests, 72 uses for, 52 time attributes, market based analysis, 293 and dates, interval variables, 551 dependency, prospecting and, 160 frames, customer signatures, 562 series analysis neural networks, 244–247 non-time series data, 246 SQL data, 572–573 statistics, 128–129 training sets coverage of values, 232 MBR (memory-based reasoning), 263–264 model sets, partitioning, 71 optimization as, 230 uses for, 52 transaction data, OLAP, 476–477 transaction processing systems, customer relationship management, 3–4 transactional records, 574 transactors, behavior-based variables, 580 transfer function, neural networks, 223 TRANS_MASTER file, customer signatures, 559 traveling salesman problem, graph theory, 327–329 trends, capturing, 75 triangle inequality, distance function, 272 trivial rules, association rules, 297 truncated data, statistics, 162 V validation assumptions, 67 neural networks, 218 validation sets model sets, partitioning, 71 test sets, partitioning, 71 uses for, 52 value added-services, predication tasks, 10 valued outcomes, estimation, values comparing with descriptions, 65 with meaning, data correction, 74 missing, 590–591 Team-Fly® www.elsolucionario.net Index TE 642 variables data selection, 63–64 variable selection problems, neural networks, 233 variance analysis of, 124 defined, 81 neural networks, 199 reduction in, splits, decision trees, 183 standard deviation and, 138 statistics, 138 variations, percent, 105 vectors, angles between, 361–362 vendor credibility, 537 virtual items, association rules, 307 virtuous cycle action tasks, 30 business opportunities, identifying, 27–28 data transformation, 28–30 discussed, 28 results, measuring, 30–32 stages of, 26 visualization tools, data exploration, 65 voice node, fax machines, 341 voice recognition, free text resources, 556 voluntary churn, 118–119, 521 voting positive ratings, 284 weighted, 281–282 W warehouses, searching data in, 61–62 warranty claims data, useful data sources, 60 web crawlers, spiders, 331 Web pages classification, useful data sources, 60 Index Web servers cookies, 109 transaction processing systems, Web sites customer response to marketing campaigns, tracking, 109 RuleQuest, 190 U.S Census Bureau, 94 weight columns, 548 weighted graphs, graph theory, 322, 324 weighted voting, 281–282 weighting, automatic cluster detection, 363–365 welcome periods, loyalty programs, 518 well-defined distance, distance function, 271 winback approach, customer relationships, 470 wireless communications industries business opportunities, identifying, 34–35 MOU (minutes of use), 38 rate plans, finding appropriate, women, differential response analysis and, 107 word-of-mouth advertising, 283 Z zip codes as categorical value, 239 distance function, 276–277 zone boundaries, adjusting, using automatic cluster detection, 380 z-scores, 551 z-values, statistics, 131, 138 643 www.elsolucionario.net www.elsolucionario.net ... Summary Data Decision-Support Summary Data Database Schema Metadata Business Rules A General Architecture for Data Warehousing Source Systems Extraction, Transformation, and Load Central Repository... Variables Statistical Measures for Continuous Variables Variance and Standard Deviation A Couple More Statistical Ideas Measuring Response Standard Error of a Proportion Comparing Results Using... Finding Fax Machines Is Useful The Data as a Graph The Approach Some Results Case Study: Segmenting Cellular Telephone Customers The Data Analyses without Graph Theory A Comparison of Two Customers

Định dạng
Số trang	673
Dung lượng	13,9 MB