470643 bindex.qxd 3/8/04 11:08 AM Page 618 618 Index auxiliary information, 569–571 availability of data, determining, 515–516 average member technique, neural networks, 252 averages, estimation, 81 B back propagation, feed-forward neural networks, 228–232 backfitting, defined, 170 bad customers, customer relationship management, 18 bad data formats, data transformation, 28 balance transfer programs, industry revolution, 18 balanced datasets, model sets, 68 balanced sampling, 68 bathtub hazards, 397–398 behaviors behavioral segments, marketing campaigns, 111–113 behavior-based variables ad hoc questions, 585 aggression, 18 convenience users, 580, 587–589 declining usage, 577–579 estimated revenue, segmenting, 581–583 ideals, comparisons to, 585–587 potential revenue, 583–585 purchasing frequency, 575–576 revolvers, 580 transactions, 580 future customer behaviors, predicting, 10 bell-shaped distribution, 132 benefit, point of maximum, 101 Bernoulli, Jacques (binomial formula), 191 biased sampling confidence intervals, statistical analysis, 146 neural networks, 227 response, methods of, 146 untruthful learning sources, 46–47 BILL_MASTER file, customer signatures, 559 binary churn models, 119 binary classification decision trees, 168 misclassification rates, 98 binary data, 557 binning, 237, 551 binomial formula (Jacques Bernoulli), 191 biological neural networks, 211 births, house-hold level data, 96 bizocity scores, 112–113 Bonferroni, Carlo (Bonferroni’s correction), 149 box diagrams, as alternative to decision trees, 199–201 brainstorming meetings, 37 branching nodes, decision trees, 176 budgets, fixed, marketing campaigns, 97–100 building models, data mining, 8, 77 Building the Data Warehouse (Bill Inmon), 474 Business Modeling and Data Mining (Dorian Pyle), 60 businesses challenges of, identifying, 23–24 customer relationship management, 2–6 customer-centric, 514–515 forward-looking, 2 home-based, 56 large-business relationships, 3–4 opportunities, identifying virtuous cycle, 27–28 wireless communication industries, 34–35 product-focused, 2 recommendation-based, 16–17 small-business relationships, 2 470643 bindex.qxd 3/8/04 11:08 AM Page 619 C Index 619 calculations, probabilities, 133–135 call detail databases, 37 call-center records, useful data sources, 60 campaigns, marketing. See also advertising acquisitions-time data, 108–110 canonical measurements, 31 champion-challenger approach, 139 credit risks, reducing exposure to, 113–114 cross-selling, 115–116 customer response, tracking, 109 customer segmentation, 111–113 differential response analysis, 107–108 discussed, 95 fixed budgets, 97–100 loyalty programs, 111 new customer information, gathering, 109–110 people most influenced by, 106–107 planning, 27 profitability, 100–104 proof-of-concept projects, 600 response modeling, 96–97 as statistical analysis acuity of testing, 147–148 confidence intervals, 146 proportion, standard error of, 139–141 results, comparing, using confidence bounds, 141–143 sample sizes, 145 targeted acquisition campaigns, 31 types of, 111 up-selling, 115–116 usage stimulation, 111 candidates, link analysis, 333 canonical measurements, marketing campaigns, 31 capture trends, data transformation, 75 car ownership, house-hold level data, 96 CART (Classification and Regression Trees) algorithm, decision trees, 185, 188–189 case studies automatic cluster detection, 374–378 chi-square tests, 155–158 decision trees, 206, 208 generic algorithms, 440–443 link analysis, 343–346 MBR (memory-based reasoning), 259–262 neural networks, 252–254 catalogs response models, decision trees for, 175 retailers, historical customer behavior data, 5 categorical variables automatic cluster detection, 359 data correction, 73 marriages, 239–240 measures of, 549 neural networks, 239–240 propensity, 242 splits, decision trees, 174 censored data hazards, 399–403 statistics, 161 census data proportional scoring, 94–95 useful data sources, 61 Central Limit Theorem, statistics, 129–130 central repository, 484, 488, 490 centroid distance, automatic cluster detection, 369 C5 pruning algorithm, decision trees, 190–191 CHAID (Chi-square Automatic Interaction Detector), 182–183 challenges, business challenges, identifying, 23–24 470643 bindex.qxd 3/8/04 11:08 AM Page 620 620 Index champion-challenger approach, marketing campaigns, 139 change processes, feedback, 34 charts concentration, 101 cumulative gains, 101 lift charts, 82, 84 time series, 128–129 CHIDIST function, 152 child nodes, classification, 167 children, number of, house-hold level data, 96 chi-square tests case study, 155–158 CHAID (Chi-square Automatic Interaction Detector), 182–183 CHIDIST function, 152 degrees of freedom values, 152–153 difference of proportions versus, 153–154 discussed, 149 expected values, calculating, 150–151 splits, decision trees, 180–183 churn as binary outcome, 119 customer longevity, predicting, 119–120 EBCF (existing base churn forecast), 469 expected, 118 forced attrition, 118 importance of, 117–118 involuntary, 118–119, 521 recognizing, 116–117 retention and, 116–120 voluntary, 118–119, 521 class labels, probability, 85 classification accuracy, 79 binary decision trees, 168 misclassification rates, 98 business goals, formulating, 605 child nodes, 167 correct classification matrix, 79 data transformation, 57 decision trees, 166–168 directed data mining, 57 discrete outcomes, 9 estimation, 9 leaf nodes, 167 memory-based reasoning, 90–91 overview, 8–9 performance, 12 Classification and Regression Trees (CART) algorithm, decision trees, 185, 188–189 classification codes discussed, 266 precision measurements, 273–274 recall measurements, 273–274 clustering automatic cluster detection agglomerative clustering, 368–370 case study, 374–378 categorical variables, 359 centroid distance, 369 complete linkage, 369 data preparation, 363–365 dimension, 352 directed clustering, 372 discussed, 12, 91, 351 distance and similarity, 359–363 divisive clustering, 371–372 evaluation, 372–373 Gaussian mixture model, 366–367 geometric distance, 360–361 hard clustering, 367 Hertzsprung-Russell diagram, 352–354 luminosity, 351 scaling, 363–364 single linkage, 369 soft clustering, 367 SOM (self-organizing map), 372 vectors, angles between, 361–362 weighting, 363–365 zone boundaries, adjusting, 380 470643 bindex.qxd 3/8/04 11:08 AM Page 621 Index 621 business goals, formulating, 605 customer attributes, 11 data transformation, 57 overview, 11 profiling tasks, 12 undirected data mining, 57 coding, special-purpose code, 595 collaborative filtering estimated ratings, 284–285 grouping customers, 90 predictions, 284–285 profiles, building and comparing, 283–284 social information filtering, 282 word-of-mouth advertising, 283 collections, credit risks, 114 columns, data cost, 548 derived variables, 542 discussed, 542 identification, 548 ignored, 547 input, 547 with one value, 544–546 target, 547 with unique values, 546–547 weight, 548 combination function attrition history, 280 MBR (memory-response reasoning), 258, 265 neural networks, 222 weighted voting, 281–282 commercial software products, 15 communication channels, prospecting, 89 companies. See businesses comparisons comparing models, using lift ratio, 81–82 data, 83 statistical analysis, 148–149 competing risks, hazards, 403 competitive advantage, information as, 14 complete linkage, automatic cluster detection, 369 computational issues, customer signatures, 594–596 concentration concentration charts, 101 cumulative response, 82–83 confidence intervals hypothesis testing, 148 statistical analysis, 146, 148–149 confusion aggregation and, 48 confusion matrix, 79 data transformation, 28 conjugate gradient, 230 constant hazards changing over time hazards versus, 416–417 discussed, 397 continuous variables data preparation, 235–237 neural networks, 235–237 statistics, 137–138 control group response marketing campaigns, 106 target market response versus, 38 controlled experiments, hypothesis testing, 51 convenience users, behavior-based variables, 580, 587–589 cookies, Web servers, 109 correct classification matrix, 79 correlation ranges, statistics, 139 costs cost columns, 548 decision tree considerations, 195 countervailing errors, 81 counts, converting to proportions, 75–76 coverage of values, neural networks, 232–233 Cox proportional hazards, 410–411 470643 bindex.qxd 3/8/04 11:08 AM Page 622 622 Index creative process, data mining as, 33 credit credit applications classification tasks, 9 prediction tasks, 10 useful data sources, 60 credit risks, reducing exposure to, 113–114 crossover, generic algorithms, 430 cross-selling opportunities affinity grouping, 11 customer relationships, 467 marketing campaigns, 111, 115–116 reasons for, 17 cross-tabulations, 136, 567–568 cumulative gains, 36, 101 cumulative response concentration, 82–83 results, assessing, 85 customers attributes, clustering, 11 behaviors of, gaining insight, 56 customer relationships bad customers, weeding out, 18 building businesses around, 2 customer acquisition, 461–464 customer activation, 464–466 customer-centric enterprises, 3 data mining role in, 5–6 data warehousing, 4–5 deep intimacy, 449, 451 event-based relationships, 458–459 good customers, holding on to, 17–18 in-between relationships, 453 indirect relationships, 453–454 interests in, 13–14 large-business relationships, 3–4 levels of, 448 life stages, 455–456 lifetime customer value, 32 mass intimacy, 451–453 retention, 467–469 service business sectors, 13–14 small-business relationships, 2 stages, 457 strategies for, 6 stratification, 469 subscription-based relationships, 459–460 survival analysis, 413–415 transaction processing systems, 3–4 up-selling, 467 winback approach, 470 customer-centric businesses, 514–515, 516–521 demographic profiles, 31 grouping, collaborative filtering and, 90 interactions, learning opportunities, 520–521 loyalty, 520 marginal, 553 new customer information gathering, 109–110 memory-based reasoning, 277 profiles, building, 283 prospective customer value, 115 responses to marketing campaigns, 109 prediction, MBR, 258 retrospective customer value, 115 segmentation, marketing campaigns, 111–113 sequential patterns, identifying, 24 signatures assembling, 68 business versus residential customers, 561 columns, pivoting, 563 computational issues, 594–596 considerations, 564 customer identification, 560–562 data for, cataloging, 559–560 discussed, 540–541 model set creation, 68 snapshots, 562 time frames, identifying, 562 single views, 517–518 TEAMFLY Team-Fly ® 470643 bindex.qxd 3/8/04 11:08 AM Page 623 Index 623 sorting, by scores, 8 telecommunications, market based analysis, 288 cutoff scores, 98 cyclic graphs, 330–331 D data acquisition-time, 108–110 as actionable information, 516 availability, determining, 515–516 binary, 557 business versus scientific, statistical analysis, 159 censored, 161 by census tract, 94 central repository, 484, 488, 490 columns cost, 548 derived variables, 542 discussed, 542 identification, 548 ignored, 547 input, 547 with one value, 544–546 target, 547 with unique values, 546–547 weight, 548 comparisons, 83 for customer signatures, cataloging, 559–560 data correction categorical variables, 73 encoding, inconsistent, 74 missing values, 73–74 numeric variables, 73 outliners, 73 overview, 72 skewed distributions, 73 values with meaning, 74 data exploration assumptions, validating, 67 descriptions, comparing values with, 65 discussed, 64 distributions, examining, 65 histograms, 565–566 intuition, 65 question asking, 67–68 data marts, 485, 491–492 data selection contents of, outcomes of interest, 64 data locations, 61–62 density, 62–63 history of, determining, 63 scarce data, 61–62 variable combinations, 63–64 data transformation capture trends, 75 counts, converting to proportions, 75–76 discussed, 74 information technology and user roles, 58–60 problems, identifying, 56–57 ratios, 75 results, deliverables, 58 results, how to use, 57–58 summarization, 44 virtuous cycle, 28–30 dirty, 592–593 dumping, flat files, 594 enterprise-wide, 33 ETL (extraction, transformation, and load) tools, 487 gigabytes, 5 as graphs, 337 historical customer behaviors, 5 MBR (memory-based reasoning), 262–263 neural networks, 219 prediction tasks, 10 house-hold level, 96 imperfections in, 34 inconsistent, 593–594 as information, 22 metadata repository, 484, 491 470643 bindex.qxd 3/8/04 11:08 AM Page 624 624 Index data (continued) missing data data correction, 73–74 NULL values, 590 splits, decision trees, 174–175 operational feedback, 485, 492 patterns meaningful discoveries, 56 prediction, 45 untruthful learning sources, 45–46 point-of-sale association rules, 288 scanners, 3 as useful data source, 60 preparation automatic cluster detection, 363–365 categorical values, neural networks, 239–240 continuous values, neural networks, 235–237 quality, association rules, 308 representation, generic algorithms, 432–433 scarce, 62 source systems, 484, 486–487 SQL, time series analysis, 572–573 terabytes, 5 truncated, 162 useful data sources, 60–61 visualization tools, 65 wrong level of detail, untruthful learning sources, 47 data mining architecture, 528–532 as creative process, 33 directed classification, 57 discussed, 7 estimation, 57 prediction, 57 documentation, 536–537 goals of, 7 insourcing, 524–525 outsourcing, 522–524 platforms, 527 scalability, 533–534 scoring platforms, 527–528 staffing, 525–526 typical operational systems versus, 33 undirected affinity grouping, 57 clustering, 57 discussed, 7 Data Preparation for Data Mining (Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph Kimball), 474 data warehousing customer patterns, 5 for decision support, 13 discussed, 4 database administrators (DBAs), 488 databases call detail, 37 demographic, 37 KDD (knowledge discovery in databases), 8 server platforms, affordability, 13 datasets, balanced, model sets, 68 dates and times, interval variables, 551 DBAs (database administrators), 488 deaths, house-hold level data, 96 debt, nonrepayment of, credit risks, 114 decision support data warehousing for, 13 hypothesis testing, 50–51 summary data, OLAP, 477–478 decision trees alphas, 188 alternate representations for, 199–202 applying to sequential events, 205 branching nodes, 176 building models, 8 case-study, 206, 208 470643 bindex.qxd 3/8/04 11:08 AM Page 625 Index 625 for catalog response models, 175 classification, 9, 166–168 cost considerations, 195 effectiveness of, measuring, 176 estimation, 170 as exploration tool, 203–204 fields, multiple, 195–197 neural networks, 199 profiling tasks, 12 projective visualization, 207–208 pruning C5 algorithm, 190–191 CART algorithm, 185, 188–189 discussed, 184 minimum support pruning, 312 stability-based, 191–192 rectangular regions, 197 regression trees, 170 rules, extracting, 193–194 SAS Enterprise Miner Tree Viewer tool, 167–168 scoring, 169–170 splits on categorical input variables, 174 chi-square testing, 180–183 discussed, 170 diversity measures, 177–178 entropy, 179 finding, 172 Gini splitting criterion, 178 information gain ratio, 178, 180 intrinsic information of, 180 missing values, 174–175 multiway, 171 on numeric input variables, 173 population diversity, 178 purity measures, 177–178 reduction in variance, 183 surrogate, 175 subtrees, selecting, 189 uses for, 166 declining usage, behavior-based variables, 577–579 deep intimacy, customer relationships, 449, 451 default classes, records, 194 default risks, proof-of-concept projects, 599 degrees of freedom values, chi-square tests, 152–153 democracy approach, memory-based reasoning, 279–281 demographic databases, 37 demographic profiles, customers, 31 density data selection, 62–63 density function, statistics, 133 deploying models, 84–85 derived variables, column data, 542 descriptions comparing values with, 65 data transformation, 57 descriptive models, assessing, 78 descriptive profiling, 52 deviation. See standard deviation difference of proportion chi-square tests versus, 153–154 statistical analysis, 143–144 differential response analysis, marketing campaigns, 107–108 differentiation, market based analysis, 289 dimension automatic cluster detection, 352 dimension tables, OLAP, 502–503 directed clustering, automatic cluster detection, 372 directed data mining classification, 57 discussed, 7 estimation, 57 prediction, 57 directed graphs, 330 directed models, assessing, 78–79 directed profiling, 52 dirty data, 592–593 470643 bindex.qxd 3/8/04 11:08 AM Page 626 626 Index discrete outcomes, classification, 9 discrete values, statistics, 127–131 discrimination measures, ROC curves, 99 dissociation rules, 317 distance and similarity, automatic cluster detection, 359–363 distance function defined, 271–272 discussed, 258, 265 hidden distance fields, 278 identity distance, 271 numeric fields, 275 triangle inequality, 272 zip codes, 276–277 distribution data exploration, 65 one-tailed, 134 probability and, 135 statistics, 130–132 two-tailed, 134 diverse data types, 536 diversity measures, splitting criteria, decision trees, 177–178 divisive clustering, automatic cluster detection, 371–372 documentation data mining, 536–537 historical data as, 61 dumping data, flat files, 594 E EBCF (existing base churn forecast), 469 economic data, useful data sources, 61 edges, graphs, 322 education level, house-hold level data, 96 e-mail as communication channel, 89 free text resources, 556–557 encoding, inconsistent, data correction, 74 enterprise-wide data, 33 entropy, information gain, 178–180 equal-height binning, 551 equal-width binning, 551 erroneous conclusions, 74 errors countervailing, 81–82 error rates adjusted, 185 establishing, 79 measurement, 159 operational, 159 predicting, 191 standard error of proportion, statistical analysis, 139–141 established customers, customer relationships, 457 estimation accuracy, 79–81 averages, 81 business goals, formulating, 605 classification tasks, 9 collaboration filtering, 284–285 data transformation, 57 decision trees, 170 directed data mining, 57 estimation task examples, 10 examples of, 10 neural networks, 10, 215 regression models, 10 revenue, behavior-based variables, 581–583 standard deviation, 81 valued outcomes, 9 ETL (extraction, transformation, and load) tools, 487, 595 evaluation, automatic cluster detection, 372–373 event-based relationships, customer relationships, 458–459 existing base churn forecast (EBCF), 469 expectations comparing to results, 31 expected values, chi-square tests, 150–151 proof-of-concept projects, 599 470643 bindex.qxd 3/8/04 11:08 AM Page 627 Index 627 expected churn, 118 experimentation hypothesis testing, 51 statistics, 160–161 exploration tools, decision trees as, 203–204 exponential decay, retention, 389–390, 393 expressive power, descriptive models, 78 extraction, transformation, and load (ETL) tools, 487, 595 F F tests (Ronald A. Fisher), 183–184 fax machines, link analysis, 337–341 Federal Express, transaction processing systems, 3–4 feedback change processes, 34 operational, 485, 492 relevance feedback, MBR, 267–268 feed-forward neural networks back propagation, 228–232 hidden layer, 227 input layer, 226 output layer, 227 field values, statistics, 128 Fisher, Ronald A. (F tests), 183–184 fixed budgets, marketing campaigns, 97–100 fixed positions, generic algorithms, 435 fixed-length character strings, 552–554 flat files, dumping data, 594 forced attrition, 118 forecasting EBCF (existing base churn forecast), 469 NSF (new start forecast), 469 survival analysis, 415–416 former customers, customer relationships, 457 forward-looking businesses, 2 fraud detection, MBR, 258 fraudulent insurance claims, classification, 9 free text response, memory-based reasoning, 285 functionality, lack of, data transformation, 28 functions activation, 222 CHIDIST, 152 combination attrition history, 280 MBR (memory-based reasoning), 258, 265 neural networks, 272 weighted voting, 281–282 density, 133 distance defined, 271–272 discussed, 258, 265 hidden distance fields, 278 identity distance, 271 numeric fields, 275 triangle inequality, 272 zip codes, 276–277 hyperbolic tangent, 223 NORMDIST, 134 NORMSINV, 147 sigmoid, 225 summation, 272 tangent, 223 transfer, 223 future attrition, 49 future customer behaviors, predicting, 10 G gains, cumulative, 36, 101 Gaussian mixture model, automatic cluster detection, 366–367 gender as categorical value, 239 profiling example, 12 generalized delta rules, 229 [...]... feedback, 267–268 similarity measurements, 271–272 training data, 263–264 weighted voting, 281–282 men, differential response analysis and, 107 messages, prospecting, 89–90 metadata repository, 484, 491 methodologies data correction, 72–74 data exploration, 64–68 data mining process, 54–55 data selection, 60–64 data transformation, 74–76 data translation, 56–60 learning sources truthful, 48–50 untruthful,... Dorian Business Modeling and Data Mining, 60 Data Preparation for Data Mining, 75 Index Q quadratic discriminates, box diagrams, 200 quality of data, association rules, 308 question asking, data exploration, 67–68 Quinlan, J Ross (Iterative Dichotomiser 3), 190 q-values, statistics, 126 R range values, statistics, 137 rate plans, wireless telephone services, 7 ratios data transformation, 75 lift ratio,... 185, 188–189 discussed, 184 minimum support pruning, 312 stability-based, 191–192 public records, house-hold level data, 96 publications Building the Data Warehouse (Bill Inmon), 474 Business Modeling and Data Mining (Dorian Pyle), 60 Data Preparation for Data Mining (Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph Kimball), 474 Genetic Algorithms in Search, Optimization, and Machine Learning (Goldberg),... association rules, 297–298 information competitive advantages, 14 data as, 22 infomediaries, 14 information brokers, supermarket chains as, 15–16 information gain, entropy, 178–180 information technology, data transformation, 58–60 as products, 14 recommendation-based businesses, 16–17 Inmon, Bill (Building the Data Warehouse), 474 input columns, 547 input layer, free-forward neural networks, 226 input... association rules, 70 business goals, formulating, 605 collaborative filtering, 284–285 credit risks, 113–114 customer longevity, 119–120 data transformation, 57 defined, 52 directed data mining, 57 errors, 191 future behaviors, 10 historical data, 10 model sets for, 70–71 neural networks, 215 patterns, 45 prediction task examples, 10 profiling versus, 52–53 response, MBR, 258 uses for, 54 probabilities calculating,... adjusting, 380 business goals, formulating, 605 customer attributes, 11 data transformation, 57 overview, 11 profiling tasks, 12 undirected data mining, 57 subscription-based relationships, cus tomer relationships, 459–460 subtrees, decision trees, 189 sum of values, statistics, 137–138 summarization, data transformation, 44 summation function, 272 supermarket chains, as information brokers, 15–16 supervised... start forecast), 469 null hypothesis, statistics and, 125–126 NULL values, missing data, 590 numeric variables data correction, 73 distance function, 275 measure of, 550–551 splits, decision trees, 173 O Occam’s Razor, 124–125 ODBC (Open Database Connectivity), 496 one-tailed distribution, 134 Online Analytic Processing (OLAP) additive facts, 501 data mining and, 507–508 decision-support summary data, ... networks, 222 Oracle, relational database management software, 13 order characteristics, market based analysis, 292 ordered lists, 239 ordered variables, measure of, 549 organizations See businesses out of time tests, 72 outliners data correction, 73 data transformation, 74 output layer, feed-forward neural networks, 227 outputs, neural networks, 215 outsourcing data mining, 522–524 overfitting, neural... unordered, 239 literature, market research, 22 logarithms, data transformation, 74 logical schema, OLAP, 478 logistic methods, box diagrams, 200 long form, census data, 94 long-term trends, 75 lookup tables, auxiliary information, 570–571 loyalty customers, 520 loyalty programs marketing campaigns, 111 welcome periods, 518 luminosity, 351 M mailings marketing campaigns, 97 non-response models, 35 Index marginal... proportion, 203 percent variations, 105 perceptrons, defined, 212 Index performance, classification, 12 physical schema, OLAP, 478 pilot projects, 598 planar graphs, 323 planned processes, proof-of-concept projects, 599 platforms, data mining, 527 point of maximum benefit, 101 point-of-sale data association rules, 288 scanners, 3 as useful data source, 60 population diversity, 178 positive ratings, voting, . level data, 96 publications Building the Data Warehouse (Bill Inmon), 474 Business Modeling and Data Mining (Dorian Pyle), 60 Data Preparation for Data Mining (Dorian Pyle), 75 The Data. Modeling and Data Mining, 60 Data Preparation for Data Mining, 75 470643 bindex.qxd 3/8/04 11:08 AM Page 637 Index 637 Q quadratic discriminates, box diagrams, 200 quality of data, association. Preparation for Data Mining (Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph Kimball), 474 data warehousing customer patterns, 5 for decision support, 13 discussed, 4 database administrators