Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 34 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
34
Dung lượng
1,27 MB
Nội dung
470643 c08.qxd 3/8/04 11:14 AM Page 278 278 Chapter 8 Furthermore, there is a general pattern of zip codes increasing from East to West. Codes that start with 0 are in New England and Puerto Rico; those beginning with 9 are on the west coast. This suggests a distance function that approximates geographic distance by looking at the high order digits of the zip code. ■■ d zip (A,B) = 0.0 if the zip codes are identical ■■ d zip (A,B) = 0.1 if the first three digits are identical (e.g., “20008” and “20015” ■■ d zip (A,B) = 0.5 if the first digits are identical (e.g., “95050” and “98125”) ■■ d zip (A,B) = 1.0 if the first digits are not identical (e.g., “02138” and “94704”) Of course, if geographic distance were truly of interest, a better approach would be to look up the latitude and longitude of each zip code in a table and calculate the distances that way (it is possible to get this information for the United States from www.census.gov). For many purposes however, geographic proximity is not nearly as important as some other measure of similarity. 10011 and 10031 are both in Manhattan, but from a marketing point of view, they don’t have much else in common, because one is an upscale downtown neigh- borhood and the other is a working class Harlem neighborhood. On the other hand 02138 and 94704 are on opposite coasts, but are likely to respond very similarly to direct mail from a political action committee, since they are for Cambridge, MA and Berkeley, CA respectively. This is just one example of how the choice of a distance metric depends on the data mining context. There are additional examples of distance and simi- larity measures in Chapter 11 where they are applied to clustering. When a Distance Metric Already Exists There are some situations where a distance metric already exists, but is diffi- cult to spot. These situations generally arise in one of two forms. Sometimes, a function already exists that provides a distance measure that can be adapted for use in MBR. The news story case study provides a good example of adapt- ing an existing function, the relevance feedback score, for use as a distance function. Other times, there are fields that do not appear to capture distance, but can be pressed into service. An example of such a hidden distance field is solicita- tion history. Two customers who were chosen for a particular solicitation in the past are “close,” even though the reasons why they were chosen may no longer be available; two who were not chosen, are close, but not as close; and one that was chosen and one that was not are far apart. The advantage of this metric is that it can incorporate previous decisions, even if the basis for the 470643 c08.qxd 3/8/04 11:14 AM Page 279 Memory-Based Reasoning and Collaborative Filtering 279 decisions is no longer available. On the other hand, it does not work well for customers who were not around during the original solicitation; so some sort of neutral weighting must be applied to them. Considering whether the original customers responded to the solicitation can extend this function further, resulting in a solicitation metric like: ■■ d solicitation (A, B) = 0, when A and B both responded to the solicitation ■■ d solicitation (A, B) = 0.1, when A and B were both chosen but neither responded ■■ d solicitation (A, B) = 0.2, when neither A nor B was chosen, but both were available in the data ■■ d solicitation (A, B) = 0.3, when A and B were both chosen, but only one responded ■■ d solicitation (A, B) = 0.3, when one or both were not considered ■■ d solicitation (A, B) = 1.0, when one was chosen and the other was not Of course, the particular values are not sacrosanct; they are only meant as a guide for measuring similarity and showing how previous information and response histories can be incorporated into a distance function. The Combination Function: Asking the Neighbors for the Answer The distance function is used to determine which records comprise the neigh- borhood. This section presents different ways to combine data gathered from those neighbors to make a prediction. At the beginning of this chapter, we estimated the median rent in the town of Tuxedo, by taking an average of the median rents in similar towns. In that example, averaging was the combination function. This section explores other methods of canvassing the neighborhood. The Basic Approach: Democracy One common combination function is for the k nearest neighbors to vote on an answer—”democracy” in data mining. When MBR is used for classification, each neighbor casts its vote for its own class. The proportion of votes for each class is an estimate of the probability that the new record belongs to the corre- sponding class. When the task is to assign a single class, it is simply the one with the most votes. When there are only two categories, an odd number of neighbors should be poled to avoid ties. As a rule of thumb, use c+1 neighbors when there are c categories to ensure that at least one class has a plurality. 470643 c08.qxd 3/8/04 11:14 AM Page 280 280 Chapter 8 In Table 8.12, the five test cases seen earlier have been augmented with a flag that signals whether the customer has become inactive. For this example, three of the customers have become inactive and two have not, an almost balanced training set. For illustrative purposes, let’s try to deter- mine if the new record is active or inactive by using different values of k for two distance functions, deuclid and dnorm (Table 8.13). The question marks indicate that no prediction has been made due to a tie among the neighbors. Notice that different values of k do affect the classifica- tion. This suggests using the percentage of neighbors in agreement to provide the level of confidence in the prediction (Table 8.14). Table 8.12 Customers with Attrition History RECNUM GENDER AGE SALARY INACTIVE 1 female 27 $19,000 no 2 male 51 $64,000 yes 3 male 52 $105,000 yes 4 female 33 $55,000 yes 5 male 45 $45,000 no new female 45 $100,000 ? Table 8.13 Using MBR to Determine if the New Customer Will Become Inactive NEIGHBOR NEIGHBORS ATTRITION K = 1 K = 2 K = 3 K = 4 K = 5 d sum 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes d Euclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes Table 8.14 Attrition Prediction with Confidence K = 1 K = 2 K = 3 K = 4 K = 5 d sum yes, 100% yes, 100% yes, 67% yes, 75% yes, 60% d Euclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60% 470643 c08.qxd 3/8/04 11:14 AM Page 281 Memory-Based Reasoning and Collaborative Filtering 281 The confidence level works just as well when there are more than two cate- gories. However, with more categories, there is a greater chance that no single category will have a majority vote. One of the key assumptions about MBR (and data mining in general) is that the training set provides sufficient infor- mation for predictive purposes. If the neighborhoods of new cases consistently produce no obvious choice of classification, then the data simply may not con- tain the necessary information and the choice of dimensions and possibly of the training set needs to be reevaluated. By measuring the effectiveness of MBR on the test set, you can determine whether the training set has a sufficient number of examples. WARNING MBR is only as good as the training set it uses. To measure whether the training set is effective, measure the results of its predictions on the test set using two, three, and four neighbors. If the results are inconclusive or inaccurate, then the training set is not large enough or the dimensions and distance metrics chosen are not appropriate. Weighted Voting Weighted voting is similar to voting in the previous section except that the neighbors are not all created equal—more like shareholder democracy than one-person, one-vote. The size of the vote is inversely proportional to the dis- tance from the new record, so closer neighbors have stronger votes than neigh- bors farther away do. To prevent problems when the distance might be 0, it is common to add 1 to the distance before taking the inverse. Adding 1 also makes all the votes between 0 and 1. Table 8.15 applies weighted voting to the previous example. The “yes, cus- tomer will become inactive” vote is the first; the “no, this is a good customer” vote is second. Weighted voting has introduced enough variation to prevent ties. The con- fidence level can now be calculated as the ratio of winning votes to total votes (Table 8.16). Table 8.15 Attrition Prediction with Weighted Voting K = 1 K = 2 K = 3 K = 4 K = 5 d sum 0.749 to 0 1.441 to 0 1.441 2.085 to 2.085 to to 0.647 0.647 1.290 d Euclid 0.669 to 0 0.669 to 0.669 to 1.157 to 1.601 to 0.562 1.062 1.062 1.062 470643 c08.qxd 3/8/04 11:14 AM Page 282 282 Chapter 8 Table 8.16 Confidence with Weighted Voting 1 2 3 4 5 d sum yes, 100% yes, 100% yes, 69% yes, 76% yes, 62% d Euclid yes, 100% yes, 54% no, 61% yes, 52% yes, 60% In this case, weighting the votes has only a small effect on the results and the confidence. The effect of weighting is largest when some neighbors are con- siderably further away than others. Weighting can also be applied to estimation by replacing the simple average of neighboring values with an average weighted by distance. This approach is used in collaborative filtering systems, as described in the following section. Collaborative Filtering: A Nearest Neighbor Approach to Making Recommendations Neither of the authors considers himself a country music fan, but one of them is the proud owner of an autographed copy of an early Dixie Chicks CD. The Chicks, who did not yet have a major record label, were performing in a local bar one day and some friends who knew them from Texas made a very enthu- siastic recommendation. The performance was truly memorable, featuring Martie Erwin’s impeccable Bluegrass fiddle, her sister Emily on a bewildering variety of other instruments (most, but not all, with strings), and the seductive vocals of Laura Lynch (who also played a stand-up electric bass). At the break, the band sold and autographed a self-produced CD that we still like better than the one that later won them a Grammy. What does this have to do with nearest neighbor techniques? Well, it is a human example of collaborative fil- tering. A recommendation from trusted friends will cause one to try something one otherwise might not try. Collaborative filtering is a variant of memory-based reasoning particularly well suited to the application of providing personalized recommendations. A collaborative filtering system starts with a history of people’s preferences. The distance function determines similarity based on overlap of preferences— people who like the same thing are close. In addition, votes are weighted by distances, so the votes of closer neighbors count more for the recommenda- tion. In other words, it is a technique for finding music, books, wine, or any- thing else that fits into the existing preferences of a particular person by using the judgments of a peer group selected for their similar tastes. This approach is also called social information filtering. TEAMFLY Team-Fly ® 470643 c08.qxd 3/8/04 11:14 AM Page 283 Memory-Based Reasoning and Collaborative Filtering 283 Collaborative filtering automates the process of using word-of-mouth to decide whether they would like something. Knowing that lots of people liked something is not enough. Who liked it is also important. Everyone values some recommendations more highly than others. The recommendation of a close friend whose past recommendations have been right on target may be enough to get you to go see a new movie even if it is in a genre you generally dislike. On the other hand, an enthusiastic recommendation from a friend who thinks Ace Ventura: Pet Detective is the funniest movie ever made might serve to warn you off one you might otherwise have gone to see. Preparing recommendations for a new customer using an automated col- laborative filtering system has three steps: 1. Building a customer profile by getting the new customer to rate a selec- tion of items such as movies, songs, or restaurants. 2. Comparing the new customer’s profile with the profiles of other cus- tomers using some measure of similarity. 3. Using some combination of the ratings of customers with similar pro- files to predict the rating that the new customer would give to items he or she has not yet rated. The following sections examine each of these steps in a bit more detail. Building Profiles One challenge with collaborative filtering is that there are often far more items to be rated than any one person is likely to have experienced or be willing to rate. That is, profiles are usually sparse, meaning that there is little overlap among the users’ preferences for making recommendations. Think of a user profile as a vector with one element per item in the universe of items to be rated. Each element of the vector represents the profile owner’s rating for the corresponding item on a scale of –5 to 5 with 0 indicating neutrality and null values for no opinion. If there are thousands or tens of thousands of elements in the vector and each customer decides which ones to rate, any two customers’ profiles are likely to end up with few overlaps. On the other hand, forcing customers to rate a particular subset may miss interesting information because ratings of more obscure items may say more about the customer than ratings of common ones. A fondness for the Beatles is less revealing than a fondness for Mose Allison. A reasonable approach is to have new customers rate a list of the twenty or so most frequently rated items (a list that might change over time) and then free them to rate as many additional items as they please. 470643 c08.qxd 3/8/04 11:14 AM Page 284 284 Chapter 8 Comparing Profiles Once a customer profile has been built, the next step is to measure its distance from other profiles. The most obvious approach would be to treat the profile vectors as geometric points and calculate the Euclidean distance between them, but many other distance measures have been tried. Some give higher weight to agreement when users give a positive rating especially when most users give negative ratings to most items. Still others apply statistical correla- tion tests to the ratings vectors. Making Predictions The final step is to use some combination of nearby profiles in order to come up with estimated ratings for the items that the customer has not rated. One approach is to take a weighted average where the weight is inversely propor- tional to the distance. The example shown in Figure 8.7 illustrates estimating the rating that Nathaniel would give to Planet of the Apes based on the opinions of his neighbors, Simon and Amelia. Nathaniel Alan Michael Stephanie Amelia Simon Crouching Tiger –1 Osmosis Jones . . . Crouching Tiger –4 Osmosis Jones . . . P eter Jenn y Apocalypse Now Vertical Ray of Sun Planet Of The Apes American Pie 2 Plan 9 From Outer Space Apocalypse Now Vertical Ray of Sun Planet Of The Apes American Pie 2 Plan 9 From Outer Space Figure 8.7 The predicted rating for Planet of the Apes is –2.66. 470643 c08.qxd 3/8/04 11:14 AM Page 285 Memory-Based Reasoning and Collaborative Filtering 285 Simon, who is distance 2 away, gave that movie a rating of –1. Amelia, who is distance 4 away, gave that movie a rating of –4. No one else’s profile is close enough to Nathaniel’s to be included in the vote. Because Amelia is twice as far away as Simon, her vote counts only half as much as his. The estimate for Nathaniel’s rating is weighted by the distance: ( 1 ⁄2 (–1) + 1 ⁄4 (–4)) / ( 1 ⁄2 + 1 ⁄4)= –1.5/0.75= –2. A good collaborative filtering system gives its users a chance to comment on the predictions and adjust the profile accordingly. In this example, if Nathaniel rents the video of Planet of the Apes despite the prediction that he will not like it, he can then enter an actual rating of his own. If it turns out that he really likes the movie and gives it a rating of 4, his new profile will be in a slightly different neighborhood and Simon’s and Amelia’s opinions will count less for Nathaniel’s next recommendation. Lessons Learned Memory based reasoning is a powerful data mining technique that can be used to solve a wide variety of data mining problems involving classification or estimation. Unlike other data mining techniques that use a training set of pre- classified data to create a model and then discard the training set, for MBR, the training set essentially is the model. Choosing the right training set is perhaps the most important step in MBR. The training set needs to include sufficient numbers of examples all possible classifications. This may mean enriching it by including a disproportionate number of instances for rare classifications in order to create a balanced train- ing set with roughly the same number of instances for all categories. A training set that includes only instances of bad customers will predict that all cus- tomers are bad. In general, the size of the training set should have at least thou- sands, if not hundreds of thousands or millions, of examples. MBR is a k-nearest neighbors approach. Determining which neighbors are near requires a distance function. There are many approaches to measuring the distance between two records. The careful choice of an appropriate distance function is a critical step in using MBR. The chapter introduced an approach to creating an overall distance function by building a distance function for each field and normalizing it. The normalized field distances can then be combined in a Euclidean fashion or summed to produce a Manhattan distance. When the Euclidean method is used, a large difference in any one field is enough to cause two records to be considered far apart. The Manhattan method is more forgiving—a large difference on one field can more easily be offset by close values on other fields. A validation set can be used to pick the best dis- tance function for a given model set by applying all candidates to see which 470643 c08.qxd 3/8/04 11:14 AM Page 286 286 Chapter 8 produces better results. Sometimes, the right choice of neighbors depends on modifying the distance function to favor some fields over others. This is easily accomplished by incorporating weights into the distance function. The next question is the number of neighbors to choose. Once again, inves- tigating different numbers of neighbors using the validation set can help determine the optimal number. There is no right number of neighbors. The number depends on the distribution of the data and is highly dependent on the problem being solved. The basic combination function, weighted voting, does a good job for cate- gorical data, using weights inversely proportional to distance. The analogous operation for estimating numeric values is a weighted average. One good application for memory based reasoning is making recommenda- tions. Collaborative filtering is an approach to making recommendations that works by grouping people with similar tastes together using a distance func- tion that can compare two lists user-supplied ratings. Recommendations for a new person are calculated using a weighted average of the ratings of his or her nearest neighbors. 470643 c09.qxd 3/8/04 11:15 AM Page 287 Market Basket Analysis and Association Rules 9 CHAPTER To convey the fundamental ideas of market basket analysis, start with the image of the shopping cart in Figure 9.1 filled with various products pur- chased by someone on a quick trip to the supermarket. This basket contains an assortment of products—orange juice, bananas, soft drink, window cleaner, and detergent. One basket tells us about what one customer purchased at one time. A complete list of purchases made by all customers provides much more information; it describes the most important part of a retailing business—what merchandise customers are buying and when. Each customer purchases a different set of products, in different quantities, at different times. Market basket analysis uses the information about what cus- tomers purchase to provide insight into who they are and why they make cer- tain purchases. Market basket analysis provides insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to promotion. This information is actionable: it can suggest new store layouts; it can determine which products to put on special; it can indicate when to issue coupons, and so on. When this data can be tied to indi- vidual customers through a loyalty card or Web site registration, it becomes even more valuable. The data mining technique most closely allied with market basket analysis is the automatic generation of association rules. Association rules represent patterns in the data without a specified target. As such, they are an example of undirected data mining. Whether the patterns make sense is left to human interpretation. 287 [...]... Three Items and Their Combinations COMBINATION PROBABILITY A 45.0 % B 42.5% C 40.0% A and B 25.0 % A and C 20.0 % B and C 15.0% A and B and C 5.0% 309 310 Chapter 9 Table 9.6 Confidence in Rules RULE P(CONDITION) P(CONDITION AND RESULT) CONFIDENCE If A and B then C 25% 5% 0.20 If A and C then B 20% 5% 0.25 If B and C then A 15% 5% 0.33 What is confidence really saying? Saying that the rule “if B and C then... found after analyzing hundreds of thousands of point-of-sale transactions from Sears Although it is valid and well supported in the data, it is still useless Similar results abound: People who buy 2-by-4s also purchase nails; customers who purchase paint buy paint brushes; oil and oil filters are purchased together, as are hamburgers and hamburger buns, and charcoal and lighter fluid A subtler problem... redeemed, and the amount of change With market basket analysis, even this limited data can yield interesting and actionable results The increasing prevalence of Web transactions, loyalty programs, and pur chasing clubs is resulting in more and more identified transactions, providing analysts with more possibilities for information about customers and their behavior over time Demographic and trending... interesting in itself, as in the Barbie doll and candy bar example But in other circumstances, it makes more sense to find an underlying rule of the form: if condition, then result Notice that this is just shorthand If the rule says, if Barbie doll, then candy bar then we read it as: “if a customer purchases a Barbie doll, then the customer is also expected to purchase a candy bar.” The general practice is to... ensuring that cus tomers must walk through candy aisles on their way back from Barbie-land It might suggest product tie-ins and promotions offering candy bars and dolls together It might suggest particular ways to advertise the products Because the rule is easily understood, it suggests plausible causes and possible interventions Market Basket Analysis and Association Rules Trivial Rules Trivial results... some window cleaner, and a six pack of soda How do the demographics of the neighborhood affect what customers buy? Is soda typically purchased with bananas? Does the brand of soda make a difference? What should be in the basket but is not? Are window cleaning products purchased when detergent and orange juice are bought together? Figure 9.1 Market basket analysis helps you understand customers as well... cars and hotel rooms, provide insight into the next product that customers are likely to purchase ■ ■ Optional services purchased by telecommunications customers (call waiting, call forwarding, DSL, speed call, and so on) help determine how to bundle these services together to maximize revenue ■ ■ Banking services used by retail customers (money market accounts, CDs, investment services, car loans, and. .. limits imposed by thousands or tens of thousands of items The next three sections delve into these concerns in more detail Team-Fly® Market Basket Analysis and Association Rules 1 First determine the right set of items and the right level For instance, is pizza an item or are the toppings items? 2 Topping Probability Next, calculate the probabilities and joint probabilities of items and combinations of... flour as well as prepackaged cake mixes Such customers might be identified from the frequency of their purchases of flour, baking powder, and similar ingredients, the proportion of such purchases to the customer s total spending, and the lack of interest in prepackaged mixes and ready-to-eat desserts Of course, such ingredients may be purchased at different times and in different quantities, making it necessary... differentiating among customers; good customers clearly order more often than not-so-good customers Figure 9.3 attempts to look at the breadth of the customer relationship (the number of unique items ever purchased) by the depth of the relationship (the number of orders) for customers who purchased more than one item This data is from a small specialty retailer The biggest bubble shows that many customers who . indicating neutrality and null values for no opinion. If there are thousands or tens of thousands of elements in the vector and each customer decides which ones to rate, any two customers’ profiles. Team-Fly ® 4706 43 c08.qxd 3/ 8/04 11:14 AM Page 2 83 Memory-Based Reasoning and Collaborative Filtering 2 83 Collaborative filtering automates the process of. useful way of differentiating among customers; good customers clearly order more often than not-so-good customers. Figure 9 .3 attempts to look at the breadth of the customer relationship (the number