John wiley sons data mining techniques for marketing sales_12 doc

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	34
Dung lượng	1,71 MB

Nội dung

470643 c10.qxd 3/8/04 11:16 AM Page 346 346 Chapter 10 Second, link analysis can apply the concepts generated by visualization to larger sets of customers. For instance, a churn reduction program might avoid targeting customers who have high inertia or be sure to target customers with high influence. This requires traversing the call graph to calculate the inertia or influence for all customers. Such derived characteristics can play an important role in marketing efforts. Different marketing programs might suggest looking for other features in the call graph. For instance, perhaps the ability to place a conference call would be desirable, but who would be the best prospects? One idea would be to look for groups of customers that all call each other. Stated as a graph problem, this group is a fully connected subgraph. In the telephone industry, these subgraphs are called “communities of interest.” A community of interest may represent a group of customers who would be interested in the ability to place conference calls. Lessons Learned Link analysis is an application of the mathematical field of graph theory. As a data mining technique, link analysis has several strengths: ■■ It capitalizes on relationships. ■■ It is useful for visualization. ■■ It creates derived characteristics that can be used for further mining. Some data and data mining problems naturally involve links. As the two case studies about telephone data show, link analysis is very useful for telecommunications—a telephone call is a link between two people. Opportu- nities for link analysis are most obvious in fields where the links are obvious such as telephony, transportation, and the World Wide Web. Link analysis is also appropriate in other areas where the connections do not have such a clear manifestation, such as physician referral patterns, retail sales data, and foren- sic analysis for crimes. Links are a very natural way to visualize some types of data. Direct visualization of the links can be a big aid to knowledge discovery. Even when auto- mated patterns are found, visualization of the links helps to better understand what is happening. Link analysis offers an alternative way of looking at data, different from the formats of relational databases and OLAP tools. Links may suggest important patterns in the data, but the significance of the patterns requires a person for interpretation. Link analysis can lead to new and useful data attributes. Examples include calculating an authority score for a page on the World Wide Web and calculating the sphere of influence for a telephone user. 470643 c10.qxd 3/8/04 11:16 AM Page 347 Link Analysis 347 Although link analysis is very powerful when applicable, it is not appropriate for all types of problems. It is not a prediction tool or classification tool like a neural network that takes data in and produces an answer. Many types of data are simply not appropriate for link analysis. Its strongest use is probably in finding specific patterns, such as the types of outgoing calls, which can then be applied to data. These patterns can be turned into new features of the data, for use in conjunction with other directed data mining techniques. 470643 c10.qxd 3/8/04 11:16 AM Page 348 470643 c11.qxd 3/8/04 11:16 AM Page 349 11 Automatic Cluster Detection CHAPTER The data mining techniques described in this book are used to find meaningful patterns in data. These patterns are not always immediately forthcoming. Sometimes this is because there are no patterns to be found. Other times, the problem is not the lack of patterns, but the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns. When mining such a database for the answer to some specific question, competing explanations tend to cancel each other out. As with radio reception, too many competing signals add up to noise. Clustering provides a way to learn about the structure of complex data, to break up the cacophony of competing signals into its components. When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply. If someone were asked to describe the color of trees in the forest, the answer would probably make distinctions between deciduous trees and evergreens, and between winter, spring, summer, and fall. People know enough about woodland flora to predict that, of all the hundreds of variables associated with the forest, season and foliage type, rather than say age and height, are the best factors to use for forming clusters of trees that follow similar coloration rules. Once the proper clusters have been defined, it is often possible to find simple patterns within each cluster. “In Winter, deciduous trees have no leaves so the trees tend to be brown” or “The leaves of deciduous trees change color in the 349 470643 c11.qxd 3/8/04 11:16 AM Page 350 350 Chapter 11 autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy dataset is actually composed of a number of better-behaved clusters. The question is: how can these be found? That is where techniques for automatic cluster detection come in—to help see the forest without getting lost in the trees. This chapter begins with two examples of the usefulness of clustering—one drawn from astronomy, another from clothing design. It then introduces the K-Means clustering algorithm which, like the nearest neighbor techniques discussed in Chapter 8, depends on a geometric interpretation of data. The geometric ideas used in K-Means bring up the more general topic of measures of similarity, association, and distance. These distance measures are quite sensi- tive to variations in how data is represented, so the next topic addressed is data preparation for clustering, with special attention being paid to scaling and weighting. K-Means is not the only algorithm in common use for automatic cluster detection. This chapter contains brief discussions of several others: Gaussian mixture models, agglomerative clustering, and divisive clustering. (Another clustering technique, self-organizing maps, is covered in Chapter 7 because self-organizing maps are a form of neural network.) The chapter concludes with a case study in which automatic cluster detection is used to evaluate editorial zones for a major daily newspaper. Searching for Islands of Simplicity In Chapter 1, where data mining techniques are classified as directed or undirected, automatic cluster detection is described as a tool for undirected knowledge discovery. In the technical sense, that is true because the automatic cluster detection algorithms themselves are simply finding structure that exists in the data without regard to any particular target variable. Most data mining tasks start out with a preclassified training set, which is used to develop a model capable of scoring or classifying previously unseen records. In clustering, there is no preclassified data and no distinction between independent and dependent variables. Instead, clustering algorithms search for groups of records—the clusters—composed of records similar to each other. The algorithms discover these similarities. It is up to the people running the analysis to determine whether similar records represent something of interest to the business—or something inexplicable and perhaps unimportant. In a broader sense, however, clustering can be a directed activity because clusters are sought for some business purpose. In marketing, clusters formed for a business purpose are usually called “segments,” and customer segmen- tation is a popular application of clustering. Automatic cluster detection is a data mining technique that is rarely used in isolation because finding clusters is not often an end in itself. Once clusters have been detected, other methods must be applied in order to figure out what 470643 c11.qxd 3/8/04 11:16 AM Page 351 Automatic Cluster Detection 351 the clusters mean. When clustering is successful, the results can be dramatic: One famous early application of cluster detection led to our current understanding of stellar evolution. Star Light, Star Bright Early in the twentieth century, astronomers trying to understand the relationship between the luminosity (brightness) of stars and their temperatures, made scatter plots like the one in Figure 11.1. The vertical scale measures luminosity in multiples of the brightness of our own sun. The horizontal scale measures surface temperature in degrees Kelvin (degrees centigrade above absolute 0, the theoretical coldest possible temperature). 10 6 10 4 10 2 1 10 -2 10 -4 Red Giants 40,000 20,000 10,000 5,000 2,500 Main Sequence White Dwarfs Luminosity (Sun = 1) Temperature (Degrees Kelvin) Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity. 470643 c11.qxd 3/8/04 11:16 AM Page 352 352 Chapter 11 Two different astronomers, Enjar Hertzsprung in Denmark and Norris Russell in the United States, thought of doing this at about the same time. They both observed that in the resulting scatter plot, the stars fall into three clusters. This observation led to further work and the understanding that these three clusters represent stars in very different phases of the stellar life cycle. The relationship between luminosity and temperature is consistent within each cluster, but the relationship is different between the clusters because fundamentally different processes are generating the heat and light. The 80 percent of stars that fall on the main sequence are generating energy by converting hydrogen to helium through nuclear fusion. This is how all stars spend most of their active life. After some number of billions of years, the hydrogen is used up. Depend- ing on its mass, the star then begins fusing helium or the fusion stops. In the lat- ter case, the core of the star collapses, generating a great deal of heat in the process. At the same time, the outer layer of gasses expands away from the core, and a red giant is formed. Eventually, the outer layer of gasses is stripped away, and the remaining core begins to cool. The star is now a white dwarf. A recent search on Google using the phrase “Hertzsprung-Russell Diagram” returned thousands of pages of links to current astronomical research based on cluster detection of this kind. Even today, clusters based on the HR diagram are being used to hunt for brown dwarfs (starlike objects that lack sufficient mass to initiate nuclear fusion) and to understand pre–main sequence stellar evolution. Fitting the Troops The Hertzsprung-Russell diagram is a good introductory example of clustering because with only two variables, it is easy to spot the clusters visually (and, incidentally, it is a good example of the importance of good data visual- izations). Even in three dimensions, picking out clusters by eye from a scatter plot cube is not too difficult. If all problems had so few dimensions, there would be no need for automatic cluster detection algorithms. As the number of dimensions (independent variables) increases, it becomes increasing difficult to visualize clusters. Our intuition about how close things are to each other also quickly breaks down with more dimensions. Saying that a problem has many dimensions is an invitation to analyze it geometrically. A dimension is each of the things that must be measured inde- pendently in order to describe something. In other words, if there are N variables, imagine a space in which the value of each variable represents a distance along the corresponding axis in an N-dimensional space. A single record containing a value for each of the N variables can be thought of as the vector that defines a particular point in that space. When there are two dimensions, this is easily plotted. The HR diagram was one such example. Figure 11.2 is another example that plots the height and weight of a group of teenagers as points on a graph. Notice the clustering of boys and girls. TEAMFLY Team-Fly ® 470643 c11.qxd 3/8/04 11:17 AM Page 353 Automatic Cluster Detection 353 The chart in Figure 11.2 begins to give a rough idea of people’s shapes. But if the goal is to fit them for clothes, a few more measurements are needed! In the 1990s, the U.S. army commissioned a study on how to redesign the uniforms of female soldiers. The army’s goal was to reduce the number of dif- Height (Inches) ferent uniform sizes that have to be kept in inventory, while still providing each soldier with well-fitting uniforms. As anyone who has ever shopped for women’s clothing is aware, there is already a surfeit of classification systems (even sizes, odd sizes, plus sizes, junior, petite, and so on) for categorizing garments by size. None of these systems was designed with the needs of the U.S. military in mind. Susan Ashdown and Beatrix Paal, researchers at Cornell University, went back to the basics; they designed a new set of sizes based on the actual shapes of women in the army. 1 80 75 70 65 60 100 125 150 175 200 Weight (Pounds) Figure 11.2 Heights and weights of a group of teenagers. 1 Ashdown, Susan P. 1998. “An Investigation of the Structure of Sizing Systems: A Comparison of Three Multidimensional Optimized Sizing Systems Generated from Anthropometric Data,” International Journal of Clothing Science and Technology. Vol. 10, #5, pp 324-341. 470643 c11.qxd 3/8/04 11:17 AM Page 354 354 Chapter 11 Unlike the traditional clothing size systems, the one Ashdown and Paal came up with is not an ordered set of graduated sizes where all dimensions increase together. Instead, they came up with sizes that fit particular body types. Each body type corresponds to a cluster of records in a database of body measurements. One cluster might consist of short-legged, small-waisted, large-busted women with long torsos, average arms, broad shoulders, and skinny necks while other clusters capture other constellations of measurements. The database contained more than 100 measurements for each of nearly 3,000 women. The clustering technique employed was the K-means algorithm, described in the next section. In the end, only a handful of the more than 100 measurements were needed to characterize the clusters. Finding this smaller number of variables was another benefit of the clustering process. K-Means Clustering The K-means algorithm is one of the most commonly used clustering algorithms. The “K” in its name refers to the fact that the algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other. The version described here was first published by J. B. MacQueen in 1967. For ease of explaining, the technique is illustrated using two-dimensional diagrams. Bear in mind that in practice the algorithm is usually handling many more than two independent variables. This means that instead of points corresponding to two-element vectors (x 1 ,x 2 ), the points correspond to n-element vectors (x 1 ,x 2 , . . . , x n ). The procedure itself is unchanged. Three Steps of the K-Means Algorithm In the first step, the algorithm randomly selects K data points to be the seeds. MacQueen’s algorithm simply takes the first K records. In cases where the records have some meaningful order, it may be desirable to choose widely spaced records, or a random selection of records. Each of the seeds is an embryonic cluster with only one element. This example sets the number of clusters to 3. The second step assigns each record to the closest seed. One way to do this is by finding the boundaries between the clusters, as shown geometrically in Figure 11.3. The boundaries between two clusters are the points that are equally close to each cluster. Recalling a lesson from high-school geometry makes this less difficult than it sounds: given any two points, A and B, all points that are equidistant from A and B fall along a line (called the perpendicular bisector) that is perpendicular to the one connecting A and B and halfway between them. In Figure 11.3, dashed lines connect the initial seeds; the resulting cluster boundaries shown with solid lines are at right angles to 470643 c11.qxd 3/8/04 11:17 AM Page 355 Automatic Cluster Detection 355 the dashed lines. Using these lines as guides, it is obvious which records are closest to which seeds. In three dimensions, these boundaries would be planes and in N dimensions they would be hyperplanes of dimension N – 1. Fortu- nately, computer algorithms easily handle these situations. Finding the actual boundaries between clusters is useful for showing the process geometrically. In practice, though, the algorithm usually measures the distance of each record to each seed and chooses the minimum distance for this step. For example, consider the record with the box drawn around it. On the basis of the initial seeds, this record is assigned to the cluster controlled by seed number 2 because it is closer to that seed than to either of the other two. At this point, every point has been assigned to exactly one of the three clusters centered around the original seeds. The third step is to calculate the centroids of the clusters; these now do a better job of characterizing the clusters than the initial seeds Finding the centroids is simply a matter of taking the average value of each dimension for all the records in the cluster. In Figure 11.4, the new centroids are marked with a cross. The arrows show the motion from the position of the original seeds to the new centroids of the clusters formed from those seeds. X 2 X 1 Seed 3 Seed 1 Seed 2 Figure 11.3 The initial seeds determine the initial cluster boundaries. [...]... editorial zones directly through clustering is that there were business reasons for wanting about a dozen edi torial zones, but no guarantee that a dozen good clusters would be found This raises the general issue of how to determine the right number of clusters for a dataset The data mining tool used for this clustering effort (MineSet, devel oped by SGI, and now available from Purple Insight) provides... were useful for distinguishing between business and residential users T I P Although the time and effort it takes to create a good customer signature can seem daunting, the effort is repaid over time because the same attributes often turn out to be predictive for many different target variables The oft quoted rule of thumb that 80 percent of the time spent on a data mining project goes into data preparation... variables The first step in turning this data into a town signature was to aggregate everything to the town level For example, the subscriber data was aggregated to produce the total number of subscribers and median subscriber household income for each town The next step was to transform counts into percentages Most of the demo graphic information was in the form of counts Even things like income,... These and other data transformation and preparation issues are discussed extensively in Chapter 17 Formal Measures of Similarity There are dozens if not hundreds of published techniques for measuring the similarity of two records Some have been developed for specialized applica tions such as comparing passages of text Others are designed especially for use with certain types of data such as binary variables... well-defined for interval variables and true measures In order to use categorical variables and rankings, it is necessary to transform them into interval variables Unfortunately, these transformations may add spurious information If ice cream flavors are assigned arbitrary numbers 1 through 28, it will appear that flavors 5 and 6 are closely related while flavors 1 and 28 are far apart These and other data. .. on a data mining project goes into data preparation becomes less true when the data preparation effort can be amortized over several predictive modeling efforts The Data The town signatures were derived from several sources, with most of the variables coming from town-level U.S Census data from 1990 and 2001 The census data provides counts of the number of residents by age, race, ethnic group, occupation,... each Gaussian has for each data point (see Figure 11.8) Each Gaussian has strong responsibility for points that are close to its mean and weak responsibility for points that are distant The responsibilities are be used as weights in the next step In the maximization step, a new centroid is calculated for each cluster taking into account the newly calculated responsibilities The centroid for a given Gaussian... Globe brought to us at Data Miners Creating Town Signatures Before deciding which towns belonged together, there needed to be a way of describing the towns—a town signature with a column for every feature that might be useful for characterizing a town and comparing it with its neighbors As it happened, Data Miners had worked on an earlier project to find towns with good prospects for future circulation... other applications, it may be desirable to filter outliers from the data; more often, the solution is to massage the data values Later in this chapter there is a section on data preparation for clustering which describes how to work with variables to make it easier to find meaningful clusters Similarity and Distance Once records in a database have been mapped to points in space, automatic cluster detection... found When screening for a very rare defect, there may not be enough examples to train a directed data mining model to detect it One example is testing electric motors at the factory that makes them Cluster detection methods can be used on a sample containing only good motors to determine the shape and size of the “normal” cluster When a motor comes along that falls outside the cluster for any reason, it . the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns. When mining such a database for the answer to. capitalizes on relationships. ■■ It is useful for visualization. ■■ It creates derived characteristics that can be used for further mining. Some data and data mining problems naturally involve links influence for all customers. Such derived characteristics can play an important role in marketing efforts. Different marketing programs might suggest looking for other features in the call graph. For

Ngày đăng: 21/06/2014, 04:20

Xem thêm

John wiley sons data mining techniques for marketing sales_12 doc