Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 34 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
34
Dung lượng
1,2 MB
Nội dung
470643 c09.qxd 3/8/04 11:15 AM Page 312 312 Chapter 9 For instance, in the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items. During the second step, the following counts are created: ■■ Milk and detergent, milk and soda, milk and cleaner ■■ Detergent and soda, detergent and cleaner ■■ Soda and cleaner This is a total of 10 pairs of items. The third pass takes all combinations of three items and so on. Of course, each of these stages may require a separate pass through the data or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time. Although it is not obvious when there are just five items, increasing the number of items in the combinations requires exponentially more computa- tion. This results in exponentially growing run times—and long, long waits when considering combinations with more than three or four items. The solu- tion is pruning. Pruning is a technique for reducing the number of items and combinations of items being considered at each step. At each stage, the algo- rithm throws out a certain number of combinations that do not meet some threshold criterion. The most common pruning threshold is called minimum support pruning. Support refers to the number of transactions in the database where the rule holds. Minimum support pruning requires that a rule hold on a minimum number of transactions. For instance, if there are one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 trans- actions are of interest. This makes sense, because the purpose of generating these rules is to pursue some sort of action—such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll—and the action must affect enough transactions to be worthwhile. The minimum support constraint has a cascading effect. Consider a rule with four items in it: if A, B, and C, then D. Using minimum support pruning, this rule has to be true on at least 10,000 transactions in the data. It follows that: A must appear in at least 10,000 transactions, and, B must appear in at least 10,000 transactions, and, C must appear in at least 10,000 transactions, and, D must appear in at least 10,000 transactions. TEAMFLY Team-Fly ® 470643 c09.qxd 3/8/04 11:15 AM Page 313 Market Basket Analysis and Association Rules 313 In other words, minimum support pruning eliminates items that do not appear in enough transactions. The threshold criterion applies to each step in the algorithm. The minimum threshold also implies that: A and B must appear together in at least 10,000 transactions, and, A and C must appear together in at least 10,000 transactions, and, A and D must appear together in at least 10,000 transactions, and so on. Each step of the calculation of the co-occurrence table can eliminate combi- nations of items that do not meet the threshold, reducing its size and the num- ber of combinations to consider during the next pass. Figure 9.11 is an example of how the calculation takes place. In this example, choosing a minimum support level of 10 percent would eliminate all the com- binations with three items—and their associated rules—from consideration. This is an example where pruning does not have an effect on the best rule since the best rule has only two items. In the case of pizza, these toppings are all fairly common, so are not pruned individually. If anchovies were included in the analysis—and there are only 15 pizzas containing them out of the 2,000— then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass. The best choice for minimum support depends on the data and the situa- tion. It is also possible to vary the minimum support as the algorithm pro- gresses. For instance, using different levels at different stages you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level). The Problem of Big Data A typical fast food restaurant offers several dozen items on its menu, say 100. To use probabilities to generate association rules, counts have to be calculated for each combination of items. The number of combinations of a given size tends to grow exponentially. A combination with three items might be a small fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! This calculation is based on the binomial formula On the other hand, a typical supermarket has at least 10,000 different items in stock, and more typ- ically 20,000 or 30,000. Figure 9.11 This example shows how to count up the frequencies on pizza sales for market basket analysis. Calculating the support, confidence, and lift quickly gets out of hand as the number of items in the combinations grows. There are almost 50 million pos- sible combinations of two items in the grocery store and over 100 billion com- binations of three items. Although computers are getting more powerful and A pizza restaurant has sold 2000 pizzas, of which: 100 are mushroom only, 150 are pepperoni, 200 are extra cheese 400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese 100 are mushroom, pepperoni, and extra cheese. 550 have no extra toppings. We need to calculate the probabilities for all possible combinations of items. There are three rules with all three items: Support = 5% Confidence = 5% divided by 25% = 0.2 Lift = 20%(100/500) divided by 40%(800/2000) = 0.5 100 pizzas or 5% 100 + 400 + 300 + 100 = 900 pizzas or 45% 150 + 400 + 200 + 100 = 850 pizzas or 42.5% 200 + 300 + 200 + 100 = 800 pizzas or 40% 400 + 100 = 500 pizzas or 25% 300 + 100 = 400 pizzas or 20% 200 + 100 = 300 pizzas or 15% Support = 5% Confidence = 5% divided by 20% = 0.25 Lift = 25%(100/400) divided by 42.5%(850/2000) = 0.588 Support = 5% Confidence = 5% divided by 15% = 0.333 Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74 Support = 25% Confidence = 25% divided by 42.5% = 0.588 Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31 The best rule has only two items: Just mushroom Mushroom and pepperoni Mushroom and extra cheese The works 314 Chapter 9 470643 c09.qxd 3/8/04 11:15 AM Page 314 470643 c09.qxd 3/8/04 11:15 AM Page 315 Market Basket Analysis and Association Rules 315 cheaper, it is still very time-consuming to calculate the counts for this number of combinations. Calculating the counts for five or more items is prohibitively expensive. The use of product hierarchies reduces the number of items to a manageable size. The number of transactions is also very large. In the course of a year, a decent-size chain of supermarkets will generate tens or hundreds of millions of transactions. Each of these transactions consists of one or more items, often several dozen at a time. So, determining if a particular combination of items is present in a particular transaction may require a bit of effort—multiplied a million-fold for all the transactions. Extending the Ideas The basic ideas of association rules can be applied to different areas, such as comparing different stores and making some enhancements to the definition of the rules. These are discussed in this section. Using Association Rules to Compare Stores Market basket analysis is commonly used to make comparisons between loca- tions within a single chain. The rule about toilet bowl cleaner sales in hardware stores is an example where sales at new stores are compared to sales at existing stores. Different stores exhibit different selling patterns for many reasons: regional trends, the effectiveness of management, dissimilar advertising, and varying demographic patterns in the catchment area, for example. Air condi- tioners and fans are often purchased during heat waves, but heat waves affect only a limited region. Within smaller areas, demographics of the catchment area can have a large impact; we would expect stores in wealthy areas to exhibit different sales patterns from those in poorer neighborhoods. These are exam- ples where market basket analysis can help to describe the differences and serve as an example of using market basket analysis for directed data mining. How can association rules be used to make these comparisons? The first step is augmenting the transactions with virtual items that specify which group, such as an existing location or a new location, that the transaction comes from. Virtual items help describe the transaction, although the virtual item is not a product or service. For instance, a sale at an existing hardware store might include the following products: ■■ A hammer ■■ A box of nails ■■ Extra-fine sandpaper 470643 c09.qxd 3/8/04 11:15 AM Page 316 316 Chapter 9 TIP Adding virtual transactions in to the market basket data makes it possible to find rules that include store characteristics andcustomer characteristics. After augmenting the data to specify where it came from, the transaction looks like: a hammer, a box of nails, extra fine sandpaper, “at existing hardware store.” To compare sales at store openings versus existing stores, the process is: 1. Gather data for a specific period (such as 2 weeks) from store openings. Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening. 2. Gather about the same amount of data from existing stores. Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations. Augment the transactions in this data with a virtual item saying that the transaction is from an exist- ing store. 3. Apply market basket analysis to find association rules in each set. 4. Pay particular attention to association rules containing the virtual items. Because association rules are undirected data mining, the rules act as start- ing points for further hypothesis testing. Why does one pattern exist at exist- ing stores and another at new stores? The rule about toilet bowl cleaners and store openings, for instance, suggests looking more closely at toilet bowl cleaner sales in existing stores at different times during the year. Using this technique, market basket analysis can be used for many other types of comparisons: ■■ Sales during promotions versus sales at other times ■■ Sales in various geographic areas, by county, standard statistical metro- politan area (SSMA), direct marketing area (DMA), or country ■■ Urban versus suburban sales ■■ Seasonal differences in sales patterns Adding virtual items to each basket of goods enables the standard associa- tion rule techniques to make these comparisons. 470643 c09.qxd 3/8/04 11:15 AM Page 317 Market Basket Analysis and Association Rules 317 Dissociation Rules A dissociation rule is similar to an association rule except that it can have the connector “and not” in the condition in addition to “and.” A typical dissocia- tion rule looks like: if A and not B, then C. Dissociation rules can be generated by a simple adaptation of the basic mar- ket basket analysis algorithm. The adaptation is to introduce a new set of items that are the inverses of each of the original items. Then, modify each transaction so it includes an inverse item if, and only if, it does not contain the original item. For example, Table 9.8 shows the transformation of a few transactions. The ¬ before the item denotes the inverse item. There are three downsides to including these new items. First, the total number of items used in the analysis doubles. Since the amount of computa- tion grows exponentially with the number of items, doubling the number of items seriously degrades performance. Second, the size of a typical transaction grows because it now includes inverted items. The third issue is that the fre- quency of the inverse items tends to be much larger than the frequency of the original items. So, minimum support constraints tend to produce rules in which all items are inverted, such as: if NOT A and NOT B then NOT C. These rules are less likely to be actionable. Sometimes it is useful to invert only the most frequent items in the set used for analysis. This is particularly valuable when the frequency of some of the original items is close to 50 percent, so the frequencies of their inverses are also close to 50 percent. Table 9.8 Transformation of Transactions to Generate Dissociation Rules CUSTOMER ITEMS CUSTOMER WITH INVERSE ITEMS 1 {A, B, C} 1 {A, B, C} 2 {A} 2 {A, ¬B, ¬C} 3 {A, C} 3 {A, ¬B, C} 4 {A} 4 {A, ¬B, ¬C} 5 {} 5 {¬A, ¬B, ¬C} 470643 c09.qxd 3/8/04 11:15 AM Page 318 318 Chapter 9 Sequential Analysis Using Association Rules Association rules find things that happen at the same time—what items are purchased at a given time. The next natural question concerns sequences of events and what they mean. Examples of results in this area are: ■■ New homeowners purchase shower curtains before purchasing furniture. ■■ Customers who purchase new lawnmowers are very likely to purchase a new garden hose in the following 6 weeks. ■■ When a customer goes into a bank branch and asks for an account rec- onciliation, there is a good chance that he or she will close all his or her accounts. Time-series data usually requires some way of identifying the customer over time. Anonymous transactions cannot reveal that new homeowners buy shower curtains before they buy furniture. This requires tracking each cus- tomer, as well as knowing which customers recently purchased a home. Since larger purchases are often made with credit cards or debit cards, this is less of a problem. For problems in other domains, such as investigating the effects of medical treatments or customer behavior inside a bank, all transactions typi- cally include identity information. WARNING In order to consider time-series analyses on your customers, there has to be some way of identifying customers. Without a way of tracking individual customers, there is no way to analyze their behavior over time. For the purposes of this section, a time series is an ordered sequence of items. It differs from a transaction only in being ordered. In general, the time series contains identifying information about the customer, since this information is used to tie the different transactions together into a series. Although there are many techniques for analyzing time series, such as ARIMA (a statistical tech- nique) and neural networks, this section discusses only how to manipulate the time-series data to apply the market basket analysis. In order to use time series, the transaction data must have two additional features: ■■ A timestamp or sequencing information to determine when transac- tions occurred relative to each other ■■ Identifying information, such as account number, household ID, or cus- tomer ID that identifies different transactions as belonging to the same customer or household (sometimes called an economic marketing unit) 470643 c09.qxd 3/8/04 11:15 AM Page 319 Market Basket Analysis and Association Rules 319 Building sequential rules is similar to the process of building association rules: 1. All items purchased by a customer are treated as a single order, and each item retains the timestamp indicating when it was purchased. 2. The process is the same for finding groups of items that appear together. 3. To develop the rules, only rules where the items on the left-hand side were purchased before items on the right-hand side are considered. The result is a set of association rules that can reveal sequential patterns. Lessons Learned Market basket data describes what customers purchase. Analyzing this data is complex, and no single technique is powerful enough to provide all the answers. The data itself typically describes the market basket at three different levels. The order is the event of the purchase; the line-items are the items in the purchase, and the customer connects orders together over time. Many important questions about customer behavior can be answered by looking at product sales over time. Which are the best selling items? Which items that sold well last year are no longer selling well this year? Inventory curves do not require transaction level data. Perhaps the most important insight they provide is the effect of marketing interventions—did sales go up or down after a particular event? However, inventory curves are not sufficient for understanding relation- ships among items in a single basket. One technique that is quite powerful is association rules. This technique finds products that tend to sell together in groups. Sometimes is the groups are sufficient for insight. Other times, the groups are turned into explicit rules—when certain items are present then we expect to find certain other items in the basket. There are three measures of association rules. Support tells how often the rule is found in the transaction data. Confidence says how often when the “if” part is true that the “then” part is also true. And, lift tells how much better the rule is at predicting the “then” part as compared to having no rule at all. The rules so generated fall into three categories. Useful rules explain a rela- tionship that was perhaps unexpected. Trivial rules explain relationships that are known (or should be known) to exist. And inexplicable rules simply do not make sense. Inexplicable rules often have weak support. 470643 c09.qxd 3/8/04 11:15 AM Page 320 320 Chapter 9 Market basket analysis and association rules provide ways to analyze item- level detail, where the relationships between items are determined by the baskets they fall into. In the next chapter, we’ll turn to link analysis, which generalizes the ideas of “items” linked by “relationships,” using the back- ground of an area of mathematics called graph theory. 470643 c10.qxd 3/8/04 11:16 AM Page 321 Link Analysis 10 CHAPTER The international route maps of British Airways and Air France offer more than just trip planning help. They also provide insights into the history and politics of their respective homelands and of lost empires. A traveler bound from New York to Mombasa changes planes at Heathrow; one bound for Abidjan changes at Charles de Gaul. The international route maps show how much information can be gained from knowing how things are connected. Which Web sites link to which other ones? Who calls whom on the tele- phone? Which physicians prescribe which drugs to which patients? These relationships are all visible in data, and they all contain a wealth of informa- tion that most data mining techniques are not able to take direct advantage of. In our ever-more-connected world (where, it has been claimed, there are no more than six degrees of separation between any two people on the planet), understanding relationships and connections is critical. Link analysis is the data mining technique that addresses this need. Link analysis is based on a branch of mathematics called graph theory. This chapter reviews the key notions of graphs, then shows how link analysis has been applied to solve real problems. Link analysis is not applicable to all types of data nor can it solve all types of problems. However, when it can be used, it 321 [...]... demographic data for prospective customers They could also distinguish between individual customers and business accounts In addition to MOU, though, their only understanding of 1 The authors would like to thank their colleagues Alan Parker, William Crowder, and Ravi Basawi for their contributions to this section 343 Chapter 10 customer behavior was the total amount billed and whether customers paid the bills... work-at-home customers, because these customers could have been paying higher business rates instead of lower resi dential rates Far from targeting such customers for marketing campaigns, the local telephone providers would deny such customers residential rates— punishing them for behaving like a small business For this company, develop ing and selling work-at-home packages represented a new foray into customer. .. For this example, we chose two profitable customers considered similar by previous segmentation techniques Link analysis showed their spe cific calling patterns and suggested how the customers differ On the other hand, looking at the call patterns for all customers at the same time would require drawing a graph with hundreds of thousands or millions of nodes and hundreds of millions of edges 345 ... precedes node B in some path that contains both A and B, then A will precede B in all paths containing both A and B (otherwise there would be a cycle) In this case, we say that A is a predecessor of B and that B is a successor of A If no paths contain both A and B, then A and B are disjoint This strict ordering can be an important property of the nodes and is sometimes useful for data mining purposes... of Two Customers Figure 10.11 illustrates two customers and their calling patterns during a typical month These two customers have similar MOU, yet the patterns are strikingly different John’s calls generate a small, tight graph, while Jane’s explodes with many different calls If Jane is happy with her wireless service, her use will likely grow and she might even influence many of her friends and colleagues... number of min utes each month that a customer uses on the cellular phone MOU is a useful measure, since there is a direct correlation between MOU and the amount billed to the customer each month This correlation is not exact, since it does not take into account discount periods and calling plans that offer free nights and weekends, but it is a good guide nonetheless The marketing group also had external... call_detail WHERE terminating_number IN (SELECT number FROM dedicated_fax) AND duration > 9 AND originating_number NOT IN (SELECT number FROM voice_numbers) GROUP BY originating_number; and for shared lines it is: SELECT originating_number FROM call_detail WHERE terminating_number IN (SELECT number FROM dedicated_fax) AND duration > 2 AND originating_number IN (SELECT number FROM voice_numbers) GROUP BY... terms matched, and the number of times the search terms are mentioned in a document is used to give the indexed documents a score that determines their rank in relation to the query The top n documents are used to establish the root set A typical value for n is 200 333 334 Chapter 10 Identifying the Candidates In the second phase, the root set is expanded to create the set of candidates The candidate set... by” link on every page Ranking Hubs and Authorities The final phase is to divide the candidate pages into hubs and authorities and rank them according to their strength in those roles This process also has the effect of grouping together pages that refer to the same meaning of a search term with multiple meanings—for instance, Madonna the rock star versus the Madonna and Child in art history or Jaguar... honor of the mathematician who posed and solved this problem 325 326 Chapter 10 A Pregel River C D N W B E S Figure 10.4 The Pregel River in Königsberg has two islands connected by a total of seven bridges A AC 1 AD AC 2 C D CD 2 BC BC BD 1 B Figure 10.5 This graph represents the layout of Königsberg The edges are bridges and the nodes are the riverbanks and islands Link Analysis WHY DO THE DEGREES . by 43 .5%(200/850) = 1.31 The best rule has only two items: Just mushroom Mushroom and pepperoni Mushroom and extra cheese The works 3 14 Chapter 9 47 0 643 c09.qxd 3/8/ 04 11:15 AM Page 3 14 470 643 . 25%(100 /40 0) divided by 42 .5%(850/2000) = 0.588 Support = 5% Confidence = 5% divided by 15% = 0.333 Lift = 33.3%(100/300) divided by 45 %(900/2000) = 0. 74 Support = 25% Confidence = 25% divided by 42 .5%. Dissociation Rules CUSTOMER ITEMS CUSTOMER WITH INVERSE ITEMS 1 {A, B, C} 1 {A, B, C} 2 {A} 2 {A, ¬B, ¬C} 3 {A, C} 3 {A, ¬B, C} 4 {A} 4 {A, ¬B, ¬C} 5 {} 5 {¬A, ¬B, ¬C} 47 0 643 c09.qxd 3/8/ 04 11:15