470643 bindex.qxd 3/8/04 11:08 AM Page 618 618 Index auxiliary information, 569–571 availability of data, determining, 515–516 average member technique, neural networks, 252 averages, estimation, 81 B back propagation, feed-forward neural networks, 228–232 backfitting, defined, 170 bad customers, customer relationship management, 18 bad data formats, data transformation, 28 balance transfer programs, industry revolution, 18 balanced datasets, model sets, 68 balanced sampling, 68 bathtub hazards, 397–398 behaviors behavioral segments, marketing campaigns, 111–113 behavior-based variables ad hoc questions, 585 aggression, 18 convenience users, 580, 587–589 declining usage, 577–579 estimated revenue, segmenting, 581–583 ideals, comparisons to, 585–587 potential revenue, 583–585 purchasing frequency, 575–576 revolvers, 580 transactions, 580 future customer behaviors, predicting, 10 bell-shaped distribution, 132 benefit, point of maximum, 101 Bernoulli, Jacques (binomial formula), 191 biased sampling confidence intervals, statistical analysis, 146 neural networks, 227 response, methods of, 146 untruthful learning sources, 46–47 BILL_MASTER file, customer signatures, 559 binary churn models, 119 binary classification decision trees, 168 misclassification rates, 98 binary data, 557 binning, 237, 551 binomial formula (Jacques Bernoulli), 191 biological neural networks, 211 births, house-hold level data, 96 bizocity scores, 112–113 Bonferroni, Carlo (Bonferroni’s correction), 149 box diagrams, as alternative to decision trees, 199–201 brainstorming meetings, 37 branching nodes, decision trees, 176 budgets, fixed, marketing campaigns, 97–100 building models, data mining, 8, 77 Building the Data Warehouse (Bill Inmon), 474 Business Modeling and Data Mining (Dorian Pyle), 60 businesses challenges of, identifying, 23–24 customer relationship management, 2–6 customer-centric, 514–515 forward-looking, 2 home-based, 56 large-business relationships, 3–4 opportunities, identifying virtuous cycle, 27–28 wireless communication industries, 34–35 product-focused, 2 recommendation-based, 16–17 small-business relationships, 2 470643 bindex.qxd 3/8/04 11:08 AM Page 619 C Index 619 calculations, probabilities, 133–135 call detail databases, 37 call-center records, useful data sources, 60 campaigns, marketing. See also advertising acquisitions-time data, 108–110 canonical measurements, 31 champion-challenger approach, 139 credit risks, reducing exposure to, 113–114 cross-selling, 115–116 customer response, tracking, 109 customer segmentation, 111–113 differential response analysis, 107–108 discussed, 95 fixed budgets, 97–100 loyalty programs, 111 new customer information, gathering, 109–110 people most influenced by, 106–107 planning, 27 profitability, 100–104 proof-of-concept projects, 600 response modeling, 96–97 as statistical analysis acuity of testing, 147–148 confidence intervals, 146 proportion, standard error of, 139–141 results, comparing, using confidence bounds, 141–143 sample sizes, 145 targeted acquisition campaigns, 31 types of, 111 up-selling, 115–116 usage stimulation, 111 candidates, link analysis, 333 canonical measurements, marketing campaigns, 31 capture trends, data transformation, 75 car ownership, house-hold level data, 96 CART (Classification and Regression Trees) algorithm, decision trees, 185, 188–189 case studies automatic cluster detection, 374–378 chi-square tests, 155–158 decision trees, 206, 208 generic algorithms, 440–443 link analysis, 343–346 MBR (memory-based reasoning), 259–262 neural networks, 252–254 catalogs response models, decision trees for, 175 retailers, historical customer behavior data, 5 categorical variables automatic cluster detection, 359 data correction, 73 marriages, 239–240 measures of, 549 neural networks, 239–240 propensity, 242 splits, decision trees, 174 censored data hazards, 399–403 statistics, 161 census data proportional scoring, 94–95 useful data sources, 61 Central Limit Theorem, statistics, 129–130 central repository, 484, 488, 490 centroid distance, automatic cluster detection, 369 C5 pruning algorithm, decision trees, 190–191 CHAID (Chi-square Automatic Interaction Detector), 182–183 challenges, business challenges, identifying, 23–24 470643 bindex.qxd 3/8/04 11:08 AM Page 620 620 Index champion-challenger approach, marketing campaigns, 139 change processes, feedback, 34 charts concentration, 101 cumulative gains, 101 lift charts, 82, 84 time series, 128–129 CHIDIST function, 152 child nodes, classification, 167 children, number of, house-hold level data, 96 chi-square tests case study, 155–158 CHAID (Chi-square Automatic Interaction Detector), 182–183 CHIDIST function, 152 degrees of freedom values, 152–153 difference of proportions versus, 153–154 discussed, 149 expected values, calculating, 150–151 splits, decision trees, 180–183 churn as binary outcome, 119 customer longevity, predicting, 119–120 EBCF (existing base churn forecast), 469 expected, 118 forced attrition, 118 importance of, 117–118 involuntary, 118–119, 521 recognizing, 116–117 retention and, 116–120 voluntary, 118–119, 521 class labels, probability, 85 classification accuracy, 79 binary decision trees, 168 misclassification rates, 98 business goals, formulating, 605 child nodes, 167 correct classification matrix, 79 data transformation, 57 decision trees, 166–168 directed data mining, 57 discrete outcomes, 9 estimation, 9 leaf nodes, 167 memory-based reasoning, 90–91 overview, 8–9 performance, 12 Classification and Regression Trees (CART) algorithm, decision trees, 185, 188–189 classification codes discussed, 266 precision measurements, 273–274 recall measurements, 273–274 clustering automatic cluster detection agglomerative clustering, 368–370 case study, 374–378 categorical variables, 359 centroid distance, 369 complete linkage, 369 data preparation, 363–365 dimension, 352 directed clustering, 372 discussed, 12, 91, 351 distance and similarity, 359–363 divisive clustering, 371–372 evaluation, 372–373 Gaussian mixture model, 366–367 geometric distance, 360–361 hard clustering, 367 Hertzsprung-Russell diagram, 352–354 luminosity, 351 scaling, 363–364 single linkage, 369 soft clustering, 367 SOM (self-organizing map), 372 vectors, angles between, 361–362 weighting, 363–365 zone boundaries, adjusting, 380 470643 bindex.qxd 3/8/04 11:08 AM Page 621 Index 621 business goals, formulating, 605 customer attributes, 11 data transformation, 57 overview, 11 profiling tasks, 12 undirected data mining, 57 coding, special-purpose code, 595 collaborative filtering estimated ratings, 284–285 grouping customers, 90 predictions, 284–285 profiles, building and comparing, 283–284 social information filtering, 282 word-of-mouth advertising, 283 collections, credit risks, 114 columns, data cost, 548 derived variables, 542 discussed, 542 identification, 548 ignored, 547 input, 547 with one value, 544–546 target, 547 with unique values, 546–547 weight, 548 combination function attrition history, 280 MBR (memory-response reasoning), 258, 265 neural networks, 222 weighted voting, 281–282 commercial software products, 15 communication channels, prospecting, 89 companies. See businesses comparisons comparing models, using lift ratio, 81–82 data, 83 statistical analysis, 148–149 competing risks, hazards, 403 competitive advantage, information as, 14 complete linkage, automatic cluster detection, 369 computational issues, customer signatures, 594–596 concentration concentration charts, 101 cumulative response, 82–83 confidence intervals hypothesis testing, 148 statistical analysis, 146, 148–149 confusion aggregation and, 48 confusion matrix, 79 data transformation, 28 conjugate gradient, 230 constant hazards changing over time hazards versus, 416–417 discussed, 397 continuous variables data preparation, 235–237 neural networks, 235–237 statistics, 137–138 control group response marketing campaigns, 106 target market response versus, 38 controlled experiments, hypothesis testing, 51 convenience users, behavior-based variables, 580, 587–589 cookies, Web servers, 109 correct classification matrix, 79 correlation ranges, statistics, 139 costs cost columns, 548 decision tree considerations, 195 countervailing errors, 81 counts, converting to proportions, 75–76 coverage of values, neural networks, 232–233 Cox proportional hazards, 410–411 470643 bindex.qxd 3/8/04 11:08 AM Page 622 622 Index creative process, data mining as, 33 credit credit applications classification tasks, 9 prediction tasks, 10 useful data sources, 60 credit risks, reducing exposure to, 113–114 crossover, generic algorithms, 430 cross-selling opportunities affinity grouping, 11 customer relationships, 467 marketing campaigns, 111, 115–116 reasons for, 17 cross-tabulations, 136, 567–568 cumulative gains, 36, 101 cumulative response concentration, 82–83 results, assessing, 85 customers attributes, clustering, 11 behaviors of, gaining insight, 56 customer relationships bad customers, weeding out, 18 building businesses around, 2 customer acquisition, 461–464 customer activation, 464–466 customer-centric enterprises, 3 data mining role in, 5–6 data warehousing, 4–5 deep intimacy, 449, 451 event-based relationships, 458–459 good customers, holding on to, 17–18 in-between relationships, 453 indirect relationships, 453–454 interests in, 13–14 large-business relationships, 3–4 levels of, 448 life stages, 455–456 lifetime customer value, 32 mass intimacy, 451–453 retention, 467–469 service business sectors, 13–14 small-business relationships, 2 stages, 457 strategies for, 6 stratification, 469 subscription-based relationships, 459–460 survival analysis, 413–415 transaction processing systems, 3–4 up-selling, 467 winback approach, 470 customer-centric businesses, 514–515, 516–521 demographic profiles, 31 grouping, collaborative filtering and, 90 interactions, learning opportunities, 520–521 loyalty, 520 marginal, 553 new customer information gathering, 109–110 memory-based reasoning, 277 profiles, building, 283 prospective customer value, 115 responses to marketing campaigns, 109 prediction, MBR, 258 retrospective customer value, 115 segmentation, marketing campaigns, 111–113 sequential patterns, identifying, 24 signatures assembling, 68 business versus residential customers, 561 columns, pivoting, 563 computational issues, 594–596 considerations, 564 customer identification, 560–562 data for, cataloging, 559–560 discussed, 540–541 model set creation, 68 snapshots, 562 time frames, identifying, 562 single views, 517–518 TEAMFLY Team-Fly ® 470643 bindex.qxd 3/8/04 11:08 AM Page 623 Index 623 sorting, by scores, 8 telecommunications, market based analysis, 288 cutoff scores, 98 cyclic graphs, 330–331 D data acquisition-time, 108–110 as actionable information, 516 availability, determining, 515–516 binary, 557 business versus scientific, statistical analysis, 159 censored, 161 by census tract, 94 central repository, 484, 488, 490 columns cost, 548 derived variables, 542 discussed, 542 identification, 548 ignored, 547 input, 547 with one value, 544–546 target, 547 with unique values, 546–547 weight, 548 comparisons, 83 for customer signatures, cataloging, 559–560 data correction categorical variables, 73 encoding, inconsistent, 74 missing values, 73–74 numeric variables, 73 outliners, 73 overview, 72 skewed distributions, 73 values with meaning, 74 data exploration assumptions, validating, 67 descriptions, comparing values with, 65 discussed, 64 distributions, examining, 65 histograms, 565–566 intuition, 65 question asking, 67–68 data marts, 485, 491–492 data selection contents of, outcomes of interest, 64 data locations, 61–62 density, 62–63 history of, determining, 63 scarce data, 61–62 variable combinations, 63–64 data transformation capture trends, 75 counts, converting to proportions, 75–76 discussed, 74 information technology and user roles, 58–60 problems, identifying, 56–57 ratios, 75 results, deliverables, 58 results, how to use, 57–58 summarization, 44 virtuous cycle, 28–30 dirty, 592–593 dumping, flat files, 594 enterprise-wide, 33 ETL (extraction, transformation, and load) tools, 487 gigabytes, 5 as graphs, 337 historical customer behaviors, 5 MBR (memory-based reasoning), 262–263 neural networks, 219 prediction tasks, 10 house-hold level, 96 imperfections in, 34 inconsistent, 593–594 as information, 22 metadata repository, 484, 491 470643 bindex.qxd 3/8/04 11:08 AM Page 624 624 Index data (continued) missing data data correction, 73–74 NULL values, 590 splits, decision trees, 174–175 operational feedback, 485, 492 patterns meaningful discoveries, 56 prediction, 45 untruthful learning sources, 45–46 point-of-sale association rules, 288 scanners, 3 as useful data source, 60 preparation automatic cluster detection, 363–365 categorical values, neural networks, 239–240 continuous values, neural networks, 235–237 quality, association rules, 308 representation, generic algorithms, 432–433 scarce, 62 source systems, 484, 486–487 SQL, time series analysis, 572–573 terabytes, 5 truncated, 162 useful data sources, 60–61 visualization tools, 65 wrong level of detail, untruthful learning sources, 47 data mining architecture, 528–532 as creative process, 33 directed classification, 57 discussed, 7 estimation, 57 prediction, 57 documentation, 536–537 goals of, 7 insourcing, 524–525 outsourcing, 522–524 platforms, 527 scalability, 533–534 scoring platforms, 527–528 staffing, 525–526 typical operational systems versus, 33 undirected affinity grouping, 57 clustering, 57 discussed, 7 Data Preparation for Data Mining (Dorian Pyle), 75 The Data Warehouse Toolkit (Ralph Kimball), 474 data warehousing customer patterns, 5 for decision support, 13 discussed, 4 database administrators (DBAs), 488 databases call detail, 37 demographic, 37 KDD (knowledge discovery in databases), 8 server platforms, affordability, 13 datasets, balanced, model sets, 68 dates and times, interval variables, 551 DBAs (database administrators), 488 deaths, house-hold level data, 96 debt, nonrepayment of, credit risks, 114 decision support data warehousing for, 13 hypothesis testing, 50–51 summary data, OLAP, 477–478 decision trees alphas, 188 alternate representations for, 199–202 applying to sequential events, 205 branching nodes, 176 building models, 8 case-study, 206, 208 470643 bindex.qxd 3/8/04 11:08 AM Page 625 Index 625 for catalog response models, 175 classification, 9, 166–168 cost considerations, 195 effectiveness of, measuring, 176 estimation, 170 as exploration tool, 203–204 fields, multiple, 195–197 neural networks, 199 profiling tasks, 12 projective visualization, 207–208 pruning C5 algorithm, 190–191 CART algorithm, 185, 188–189 discussed, 184 minimum support pruning, 312 stability-based, 191–192 rectangular regions, 197 regression trees, 170 rules, extracting, 193–194 SAS Enterprise Miner Tree Viewer tool, 167–168 scoring, 169–170 splits on categorical input variables, 174 chi-square testing, 180–183 discussed, 170 diversity measures, 177–178 entropy, 179 finding, 172 Gini splitting criterion, 178 information gain ratio, 178, 180 intrinsic information of, 180 missing values, 174–175 multiway, 171 on numeric input variables, 173 population diversity, 178 purity measures, 177–178 reduction in variance, 183 surrogate, 175 subtrees, selecting, 189 uses for, 166 declining usage, behavior-based variables, 577–579 deep intimacy, customer relationships, 449, 451 default classes, records, 194 default risks, proof-of-concept projects, 599 degrees of freedom values, chi-square tests, 152–153 democracy approach, memory-based reasoning, 279–281 demographic databases, 37 demographic profiles, customers, 31 density data selection, 62–63 density function, statistics, 133 deploying models, 84–85 derived variables, column data, 542 descriptions comparing values with, 65 data transformation, 57 descriptive models, assessing, 78 descriptive profiling, 52 deviation. See standard deviation difference of proportion chi-square tests versus, 153–154 statistical analysis, 143–144 differential response analysis, marketing campaigns, 107–108 differentiation, market based analysis, 289 dimension automatic cluster detection, 352 dimension tables, OLAP, 502–503 directed clustering, automatic cluster detection, 372 directed data mining classification, 57 discussed, 7 estimation, 57 prediction, 57 directed graphs, 330 directed models, assessing, 78–79 directed profiling, 52 dirty data, 592–593 470643 bindex.qxd 3/8/04 11:08 AM Page 626 626 Index discrete outcomes, classification, 9 discrete values, statistics, 127–131 discrimination measures, ROC curves, 99 dissociation rules, 317 distance and similarity, automatic cluster detection, 359–363 distance function defined, 271–272 discussed, 258, 265 hidden distance fields, 278 identity distance, 271 numeric fields, 275 triangle inequality, 272 zip codes, 276–277 distribution data exploration, 65 one-tailed, 134 probability and, 135 statistics, 130–132 two-tailed, 134 diverse data types, 536 diversity measures, splitting criteria, decision trees, 177–178 divisive clustering, automatic cluster detection, 371–372 documentation data mining, 536–537 historical data as, 61 dumping data, flat files, 594 E EBCF (existing base churn forecast), 469 economic data, useful data sources, 61 edges, graphs, 322 education level, house-hold level data, 96 e-mail as communication channel, 89 free text resources, 556–557 encoding, inconsistent, data correction, 74 enterprise-wide data, 33 entropy, information gain, 178–180 equal-height binning, 551 equal-width binning, 551 erroneous conclusions, 74 errors countervailing, 81–82 error rates adjusted, 185 establishing, 79 measurement, 159 operational, 159 predicting, 191 standard error of proportion, statistical analysis, 139–141 established customers, customer relationships, 457 estimation accuracy, 79–81 averages, 81 business goals, formulating, 605 classification tasks, 9 collaboration filtering, 284–285 data transformation, 57 decision trees, 170 directed data mining, 57 estimation task examples, 10 examples of, 10 neural networks, 10, 215 regression models, 10 revenue, behavior-based variables, 581–583 standard deviation, 81 valued outcomes, 9 ETL (extraction, transformation, and load) tools, 487, 595 evaluation, automatic cluster detection, 372–373 event-based relationships, customer relationships, 458–459 existing base churn forecast (EBCF), 469 expectations comparing to results, 31 expected values, chi-square tests, 150–151 proof-of-concept projects, 599 470643 bindex.qxd 3/8/04 11:08 AM Page 627 Index 627 expected churn, 118 experimentation hypothesis testing, 51 statistics, 160–161 exploration tools, decision trees as, 203–204 exponential decay, retention, 389–390, 393 expressive power, descriptive models, 78 extraction, transformation, and load (ETL) tools, 487, 595 F F tests (Ronald A. Fisher), 183–184 fax machines, link analysis, 337–341 Federal Express, transaction processing systems, 3–4 feedback change processes, 34 operational, 485, 492 relevance feedback, MBR, 267–268 feed-forward neural networks back propagation, 228–232 hidden layer, 227 input layer, 226 output layer, 227 field values, statistics, 128 Fisher, Ronald A. (F tests), 183–184 fixed budgets, marketing campaigns, 97–100 fixed positions, generic algorithms, 435 fixed-length character strings, 552–554 flat files, dumping data, 594 forced attrition, 118 forecasting EBCF (existing base churn forecast), 469 NSF (new start forecast), 469 survival analysis, 415–416 former customers, customer relationships, 457 forward-looking businesses, 2 fraud detection, MBR, 258 fraudulent insurance claims, classification, 9 free text response, memory-based reasoning, 285 functionality, lack of, data transformation, 28 functions activation, 222 CHIDIST, 152 combination attrition history, 280 MBR (memory-based reasoning), 258, 265 neural networks, 272 weighted voting, 281–282 density, 133 distance defined, 271–272 discussed, 258, 265 hidden distance fields, 278 identity distance, 271 numeric fields, 275 triangle inequality, 272 zip codes, 276–277 hyperbolic tangent, 223 NORMDIST, 134 NORMSINV, 147 sigmoid, 225 summation, 272 tangent, 223 transfer, 223 future attrition, 49 future customer behaviors, predicting, 10 G gains, cumulative, 36, 101 Gaussian mixture model, automatic cluster detection, 366–367 gender as categorical value, 239 profiling example, 12 generalized delta rules, 229 [...]... decision-making process, 50–51 generating, 51 market basket analysis, 51 null hypothesis, statistics and, 125–126 I IBM, relational database management software, 13 ID and key variables, 554 ID3 (Iteractive Dichotomiser 3), 190 identification columns, 548 customer signatures, 560–562 good prospects, 88–89 problem management, 43 proof-of-concept projects, 599–601 identified versus anonymous transactions, association... K key and ID variables, 554 KDD (knowledge discovery in databases), 8 Kimball, Ralph (The Data Warehouse Toolkit), 474 Kleinberg algorithm, link analysis, 332–333 K-means clustering, 354–358 knowledge discovery in databases (KDD), 8 Kolmogorov-Smirnov (KS) tests, 101 L large-business relationships, customer relationship management, 3–4 leaf nodes, classification, 167 learning opportunities, customer. .. retention calculating, 385–386 churn and, 116–120 customer relationships, 467–469 exponential decay, 389–390, 393 hazards, 404–405 median customer lifetime value, 387 retention curves, 386–389 truncated mean lifetime value, 389 retrospective customer value, 115 revenue, behavior-based variables, 581–585 revolvers, behavior-based variables, 580 RFM (recency, frequency, and monetary) value, 575 ring diagrams,... data, time series analysis, 572–573 stability-based pruning, decision trees, 191–192 staffing, data mining, 525–526 standard deviation estimation, 81 statistics, 132, 138 variance and, 138 standard error of proportion, statistical analysis, 139–141 standardization, numeric values, 551 standardized values, statistics, 129–133 star schema structure, relational databases, 505 statistical analysis business... hypothesis, statistics and, 125–126 641 Index truncated mean lifetime value, retention, 389 truthful learning sources, 48–50 two-tailed distribution, 134 U undirected data mining affinity grouping, 57 clustering, 57 discussed, 7 uniform distribution, statistics, 132 uniform product code (UPC), 555 UNIT_MASTER file, customer signatures, 559 unordered lists, 239 unsupervised learning, 57 untruthful learning sources,... up-selling customer relationships, 467 marketing campaigns, 111, 115–116 U.S Census Bureau Web site, 94 usage stimulation marketing campaigns, 111 user roles, data transformation, 58–60 AM FL Y testing (continued) KS (Kolmogorov-Smirnov) tests, 101 preclassified tests, 79 test groups, marketing campaigns, 106 test sets out of time tests, 72 uses for, 52 time attributes, market based analysis, 293 and dates,... truthful sources, 48–50 unsupervised, 57 untruthful sources, 44–48 life stages, customer relationships, 455–456 lifetime customer value, customer relationships, 32 lift ratio comparing models using, 81–82 lift charts, 82, 84 problems with, 83 linear processes, 55 linear regression, 139 link analysis authorities, 333–334 candidates, 333 case study, 343–346 classification, 9 discussed, 321 fax machines,... affordability, 13 service business sectors, customer relationships, 13–14 shared labels, fax machines, 341 short form, census data, 94 short-term trends, 75 sigmoid action functions, neural networks, 225 signatures, customers assembling, 68 business versus residential customers, 561 columns, pivoting, 563 computational issues, 594–596 considerations, 564 customer identification, 560–562 data for, cataloging,... correction, 73 SKUs (stock-keeping units), 305 small-business relationships, customer relationship management, 2 SMP (symmetric multiprocessor), 485 snapshots, customer signatures, 562 social information filtering, 282 soft clustering, automatic cluster detection, 367 SOI (sphere of influence), 38 sole proprietors, 3 solicitation, marketing campaigns, 96 SOM (self-organizing map), 249–251, 372 source systems,... graphs, 77 lists, ordered and unordered, 239 literature, market research, 22 logarithms, data transformation, 74 logical schema, OLAP, 478 logistic methods, box diagrams, 200 long form, census data, 94 long-term trends, 75 lookup tables, auxiliary information, 570–571 loyalty customers, 520 loyalty programs marketing campaigns, 111 welcome periods, 518 luminosity, 351 M mailings marketing campaigns, 97 . zip codes, 276–277 distribution data exploration, 65 one-tailed, 134 probability and, 135 statistics, 130 132 two-tailed, 134 diverse data types, 536 diversity measures, splitting criteria,. 283 prospective customer value, 115 responses to marketing campaigns, 109 prediction, MBR, 258 retrospective customer value, 115 segmentation, marketing campaigns, 111– 113 sequential patterns,. champion-challenger approach, 139 credit risks, reducing exposure to, 113 114 cross-selling, 115–116 customer response, tracking, 109 customer segmentation, 111– 113 differential response analysis,