Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 34 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
34
Dung lượng
1,71 MB
Nội dung
470643 c10.qxd 3/8/04 11:16 AM Page 346 346 Chapter 10 Second, link analysis can apply the concepts generated by visualization to larger sets of customers. For instance, a churn reduction program might avoid targeting customers who have high inertia or be sure to target customers with high influence. This requires traversing the call graph to calculate the inertia or influence for all customers. Such derived characteristics can play an important role in marketing efforts. Different marketing programs might suggest looking for other features in the call graph. For instance, perhaps the ability to place a conference call would be desirable, but who would be the best prospects? One idea would be to look for groups of customers that all call each other. Stated as a graph prob- lem, this group is a fully connected subgraph. In the telephone industry, these subgraphs are called “communities of interest.” A community of interest may represent a group of customers who would be interested in the ability to place conference calls. Lessons Learned Link analysis is an application of the mathematical field of graph theory. As a data mining technique, link analysis has several strengths: ■■ It capitalizes on relationships. ■■ It is useful for visualization. ■■ It creates derived characteristics that can be used for further mining. Some data and data mining problems naturally involve links. As the two case studies about telephone data show, link analysis is very useful for telecommunications—a telephone call is a link between two people. Opportu- nities for link analysis are most obvious in fields where the links are obvious such as telephony, transportation, and the World Wide Web. Link analysis is also appropriate in other areas where the connections do not have such a clear manifestation, such as physician referral patterns, retail sales data, and foren- sic analysis for crimes. Links are a very natural way to visualize some types of data. Direct visual- ization of the links can be a big aid to knowledge discovery. Even when auto- mated patterns are found, visualization of the links helps to better understand what is happening. Link analysis offers an alternative way of looking at data, different from the formats of relational databases and OLAP tools. Links may suggest important patterns in the data, but the significance of the patterns requires a person for interpretation. Link analysis can lead to new and useful data attributes. Examples include calculating an authority score for a page on the World Wide Web and calculat- ing the sphere of influence for a telephone user. 470643 c10.qxd 3/8/04 11:16 AM Page 347 Link Analysis 347 Although link analysis is very powerful when applicable, it is not appropri- ate for all types of problems. It is not a prediction tool or classification tool like a neural network that takes data in and produces an answer. Many types of data are simply not appropriate for link analysis. Its strongest use is probably in finding specific patterns, such as the types of outgoing calls, which can then be applied to data. These patterns can be turned into new features of the data, for use in conjunction with other directed data mining techniques. 470643 c10.qxd 3/8/04 11:16 AM Page 348 470643 c11.qxd 3/8/04 11:16 AM Page 349 11 Automatic Cluster Detection CHAPTER The data mining techniques described in this book are used to find meaning- ful patterns in data. These patterns are not always immediately forthcoming. Sometimes this is because there are no patterns to be found. Other times, the problem is not the lack of patterns, but the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns. When mining such a database for the answer to some specific question, competing explanations tend to cancel each other out. As with radio reception, too many competing signals add up to noise. Clustering provides a way to learn about the structure of complex data, to break up the cacophony of competing signals into its components. When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply. If someone were asked to describe the color of trees in the forest, the answer would probably make distinctions between deciduous trees and evergreens, and between winter, spring, summer, and fall. People know enough about woodland flora to predict that, of all the hundreds of vari- ables associated with the forest, season and foliage type, rather than say age and height, are the best factors to use for forming clusters of trees that follow similar coloration rules. Once the proper clusters have been defined, it is often possible to find simple patterns within each cluster. “In Winter, deciduous trees have no leaves so the trees tend to be brown” or “The leaves of deciduous trees change color in the 349 470643 c11.qxd 3/8/04 11:16 AM Page 350 350 Chapter 11 autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy dataset is actually composed of a number of better-behaved clusters. The ques- tion is: how can these be found? That is where techniques for automatic cluster detection come in—to help see the forest without getting lost in the trees. This chapter begins with two examples of the usefulness of clustering—one drawn from astronomy, another from clothing design. It then introduces the K-Means clustering algorithm which, like the nearest neighbor techniques dis- cussed in Chapter 8, depends on a geometric interpretation of data. The geo- metric ideas used in K-Means bring up the more general topic of measures of similarity, association, and distance. These distance measures are quite sensi- tive to variations in how data is represented, so the next topic addressed is data preparation for clustering, with special attention being paid to scaling and weighting. K-Means is not the only algorithm in common use for auto- matic cluster detection. This chapter contains brief discussions of several others: Gaussian mixture models, agglomerative clustering, and divisive clus- tering. (Another clustering technique, self-organizing maps, is covered in Chapter 7 because self-organizing maps are a form of neural network.) The chapter concludes with a case study in which automatic cluster detection is used to evaluate editorial zones for a major daily newspaper. Searching for Islands of Simplicity In Chapter 1, where data mining techniques are classified as directed or undi- rected, automatic cluster detection is described as a tool for undirected knowl- edge discovery. In the technical sense, that is true because the automatic cluster detection algorithms themselves are simply finding structure that exists in the data without regard to any particular target variable. Most data mining tasks start out with a preclassified training set, which is used to develop a model capable of scoring or classifying previously unseen records. In clustering, there is no preclassified data and no distinction between inde- pendent and dependent variables. Instead, clustering algorithms search for groups of records—the clusters—composed of records similar to each other. The algorithms discover these similarities. It is up to the people running the analysis to determine whether similar records represent something of interest to the business—or something inexplicable and perhaps unimportant. In a broader sense, however, clustering can be a directed activity because clusters are sought for some business purpose. In marketing, clusters formed for a business purpose are usually called “segments,” andcustomer segmen- tation is a popular application of clustering. Automatic cluster detection is a data mining technique that is rarely used in isolation because finding clusters is not often an end in itself. Once clusters have been detected, other methods must be applied in order to figure out what 470643 c11.qxd 3/8/04 11:16 AM Page 351 Automatic Cluster Detection 351 the clusters mean. When clustering is successful, the results can be dramatic: One famous early application of cluster detection led to our current under- standing of stellar evolution. Star Light, Star Bright Early in the twentieth century, astronomers trying to understand the relation- ship between the luminosity (brightness) of stars and their temperatures, made scatter plots like the one in Figure 11.1. The vertical scale measures lumi- nosity in multiples of the brightness of our own sun. The horizontal scale measures surface temperature in degrees Kelvin (degrees centigrade above absolute 0, the theoretical coldest possible temperature). 10 6 10 4 10 2 1 10 -2 10 -4 Red Giants 40,000 20,000 10,000 5,000 2,500 Main Sequence White Dwarfs Luminosity (Sun = 1) Temperature (Degrees Kelvin) Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity. 470643 c11.qxd 3/8/04 11:16 AM Page 352 352 Chapter 11 Two different astronomers, Enjar Hertzsprung in Denmark and Norris Russell in the United States, thought of doing this at about the same time. They both observed that in the resulting scatter plot, the stars fall into three clusters. This observation led to further work and the understanding that these three clusters represent stars in very different phases of the stellar life cycle. The rela- tionship between luminosity and temperature is consistent within each cluster, but the relationship is different between the clusters because fundamentally different processes are generating the heat and light. The 80 percent of stars that fall on the main sequence are generating energy by converting hydrogen to helium through nuclear fusion. This is how all stars spend most of their active life. After some number of billions of years, the hydrogen is used up. Depend- ing on its mass, the star then begins fusing helium or the fusion stops. In the lat- ter case, the core of the star collapses, generating a great deal of heat in the process. At the same time, the outer layer of gasses expands away from the core, and a red giant is formed. Eventually, the outer layer of gasses is stripped away, and the remaining core begins to cool. The star is now a white dwarf. A recent search on Google using the phrase “Hertzsprung-Russell Diagram” returned thousands of pages of links to current astronomical research based on cluster detection of this kind. Even today, clusters based on the HR diagram are being used to hunt for brown dwarfs (starlike objects that lack sufficient mass to initiate nuclear fusion) and to understand pre–main sequence stellar evolution. Fitting the Troops The Hertzsprung-Russell diagram is a good introductory example of cluster- ing because with only two variables, it is easy to spot the clusters visually (and, incidentally, it is a good example of the importance of good data visual- izations). Even in three dimensions, picking out clusters by eye from a scatter plot cube is not too difficult. If all problems had so few dimensions, there would be no need for automatic cluster detection algorithms. As the number of dimensions (independent variables) increases, it becomes increasing diffi- cult to visualize clusters. Our intuition about how close things are to each other also quickly breaks down with more dimensions. Saying that a problem has many dimensions is an invitation to analyze it geometrically. A dimension is each of the things that must be measured inde- pendently in order to describe something. In other words, if there are N vari- ables, imagine a space in which the value of each variable represents a distance along the corresponding axis in an N-dimensional space. A single record con- taining a value for each of the N variables can be thought of as the vector that defines a particular point in that space. When there are two dimensions, this is easily plotted. The HR diagram was one such example. Figure 11.2 is another example that plots the height and weight of a group of teenagers as points on a graph. Notice the clustering of boys and girls. TEAMFLY Team-Fly ® 470643 c11.qxd 3/8/04 11:17 AM Page 353 Automatic Cluster Detection 353 The chart in Figure 11.2 begins to give a rough idea of people’s shapes. But if the goal is to fit them for clothes, a few more measurements are needed! In the 1990s, the U.S. army commissioned a study on how to redesign the uniforms of female soldiers. The army’s goal was to reduce the number of dif- Height (Inches) ferent uniform sizes that have to be kept in inventory, while still providing each soldier with well-fitting uniforms. As anyone who has ever shopped for women’s clothing is aware, there is already a surfeit of classification systems (even sizes, odd sizes, plus sizes, junior, petite, and so on) for categorizing garments by size. None of these systems was designed with the needs of the U.S. military in mind. Susan Ashdown and Beatrix Paal, researchers at Cornell University, went back to the basics; they designed a new set of sizes based on the actual shapes of women in the army. 1 80 75 70 65 60 100 125 150 175 200 Weight (Pounds) Figure 11.2 Heights and weights of a group of teenagers. 1 Ashdown, Susan P. 1998. “An Investigation of the Structure of Sizing Systems: A Comparison of Three Multidimensional Optimized Sizing Systems Generated from Anthropometric Data,” International Journal of Clothing Science and Technology. Vol. 10, #5, pp 324-341. 470643 c11.qxd 3/8/04 11:17 AM Page 354 354 Chapter 11 Unlike the traditional clothing size systems, the one Ashdown and Paal came up with is not an ordered set of graduated sizes where all dimensions increase together. Instead, they came up with sizes that fit particular body types. Each body type corresponds to a cluster of records in a database of body measure- ments. One cluster might consist of short-legged, small-waisted, large-busted women with long torsos, average arms, broad shoulders, and skinny necks while other clusters capture other constellations of measurements. The database contained more than 100 measurements for each of nearly 3,000 women. The clustering technique employed was the K-means algorithm, described in the next section. In the end, only a handful of the more than 100 measurements were needed to characterize the clusters. Finding this smaller number of variables was another benefit of the clustering process. K-Means Clustering The K-means algorithm is one of the most commonly used clustering algo- rithms. The “K” in its name refers to the fact that the algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other. The version described here was first published by J. B. MacQueen in 1967. For ease of explaining, the technique is illustrated using two-dimensional diagrams. Bear in mind that in practice the algorithm is usually handling many more than two independent variables. This means that instead of points corre- sponding to two-element vectors (x 1 ,x 2 ), the points correspond to n-element vectors (x 1 ,x 2 , . . . , x n ). The procedure itself is unchanged. Three Steps of the K-Means Algorithm In the first step, the algorithm randomly selects K data points to be the seeds. MacQueen’s algorithm simply takes the first K records. In cases where the records have some meaningful order, it may be desirable to choose widely spaced records, or a random selection of records. Each of the seeds is an embryonic cluster with only one element. This example sets the number of clusters to 3. The second step assigns each record to the closest seed. One way to do this is by finding the boundaries between the clusters, as shown geometrically in Figure 11.3. The boundaries between two clusters are the points that are equally close to each cluster. Recalling a lesson from high-school geometry makes this less difficult than it sounds: given any two points, A and B, all points that are equidistant from A and B fall along a line (called the perpen- dicular bisector) that is perpendicular to the one connecting A and B and halfway between them. In Figure 11.3, dashed lines connect the initial seeds; the resulting cluster boundaries shown with solid lines are at right angles to 470643 c11.qxd 3/8/04 11:17 AM Page 355 Automatic Cluster Detection 355 the dashed lines. Using these lines as guides, it is obvious which records are closest to which seeds. In three dimensions, these boundaries would be planes and in N dimensions they would be hyperplanes of dimension N – 1. Fortu- nately, computer algorithms easily handle these situations. Finding the actual boundaries between clusters is useful for showing the process geometrically. In practice, though, the algorithm usually measures the distance of each record to each seed and chooses the minimum distance for this step. For example, consider the record with the box drawn around it. On the basis of the initial seeds, this record is assigned to the cluster controlled by seed number 2 because it is closer to that seed than to either of the other two. At this point, every point has been assigned to exactly one of the three clus- ters centered around the original seeds. The third step is to calculate the cen- troids of the clusters; these now do a better job of characterizing the clusters than the initial seeds Finding the centroids is simply a matter of taking the average value of each dimension for all the records in the cluster. In Figure 11.4, the new centroids are marked with a cross. The arrows show the motion from the position of the original seeds to the new centroids of the clusters formed from those seeds. X 2 X 1 Seed 3 Seed 1 Seed 2 Figure 11.3 The initial seeds determine the initial cluster boundaries. [...]... Similarity and Distance Once records in a database have been mapped to points in space, automatic cluster detection is really quite simple—a little geometry, some vector means, et voilà! The problem, of course, is that the databases encountered in marketing, sales, and customer support are not about points in space They are about purchases, phone calls, airplane trips, car registrations, and a thousand other... Two points that differ by 2 in dimensions X and Y and by 1 in dimension Z are the same distance apart as two other points that differ by 1 in dimension X and by 2 in dimensions Y and Z It doesn’t matter what units X, Y, and Z are measured in, so long as they are the same 363 364 Chapter 11 But what if X is measured in yards, Y is measured in centimeters, and Z is measured in nautical miles? A difference... with sardines, cod, and tuna, while kittens cluster with cougars, lions, and tigers, even though in a database of body-part lengths, the sardine is closer to a kitten than it is to a catfish The solution is to use a different geometric interpretation of the same data Instead of thinking of X and Y as points in space and measuring the distance between them, think of them as vectors and measure the angle... Euclidian distance between X and Y, first find the differences between the corresponding elements of X and Y (the distance along each axis) and square them The dis tance is the square root of the sum of the squared differences Automatic Cluster Detection DISTANCE METRICS Any function that takes two points and produces a single number describing a relationship between them is a candidate measure of similarity,... than two families on the same size plot, and we want that to be taken into consideration during clustering? That is where weighting comes in The purpose of weighting is to encode the information that one variable is more (or less) important than others A good place to starts is by standardizing all variables so each has a mean of zero and a variance (and standard deviation) of one That way, all fields... choice for customer segmentation than the undirected clus tering algorithms discussed in this chapter If the purpose of the customer seg mentation is to find customer segments that are loyal or profitable or likely to respond to some particular offer, it makes sense to use one of those variables (or a proxie) as the target for directed clustering If, on the other hand, the point of the customer segmentation... sorts of things In another example, a long-distance company developed customer Automatic Cluster Detection signatures based on call detail data in order to predict fraud and later found that the same variables were useful for distinguishing between business and residential users T I P Although the time and effort it takes to create a good customer signature can seem daunting, the effort is repaid over time... notions of scaling and weighting each play important roles in clustering Although similar, and often confused with each other, the two notions are not the same Scaling adjusts the values of variables to take into account the fact that different variables are measured in different units or over different ranges For instance, household income is measured in tens of thousands of dollars and number of children... as old as a 25-year-old or that a 10-pound bag of sugar is twice as heavy as a 5-pound one Age, weight, length, customer tenure, and volume are examples of true measures Geometric distance metrics are well-defined for interval variables and true measures In order to use categorical variables and rankings, it is necessary to transform them into interval variables Unfortunately, these transformations may... the mean of all the values it takes on This is often called “indexing a variable.” ■ ■ Subtract the mean value from each variable and then divide it by the standard deviation This is often called standardization or “converting to z-scores.” A z-score tells you how many standard deviations away from the mean a value is Normalizing a single variable simply changes its range A closely related concept is . sizes based on the actual shapes of women in the army. 1 80 75 70 65 60 100 1 25 150 1 75 200 Weight (Pounds) Figure 11.2 Heights and weights of a group of teenagers. 1 Ashdown, Susan P and evergreens, and between winter, spring, summer, and fall. People know enough about woodland flora to predict that, of all the hundreds of vari- ables associated with the forest, season and. Clothing Science and Technology. Vol. 10, #5, pp 324-341. 470643 c11.qxd 3/8/04 11:17 AM Page 354 354 Chapter 11 Unlike the traditional clothing size systems, the one Ashdown and Paal came up