Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 53 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
53
Dung lượng
610,09 KB
Nội dung
and mining techniques. Using the figure, try to understand the connections. Please study the following statements: ț Data mining algorithms are part of data mining techniques. ț Data mining techniques are used to carry out data mining functions. While perform- ing specific data mining functions, you are applying data mining processes. ț A specific data mining function is generally suitable to a given application area. ț Each application area is a major area in business where data mining is actively used now. We will devote the rest of this section to discussing the highlights of the major functions, the processes used to carry out the functions, and the data mining techniques themselves. Data mining covers a broad range of techniques. This is not a textbook on data mining and a detailed discussion of the data mining algorithms is not within its scope. There are a number of well-written books in the field and you may refer to them to pursue your interest. Let us explore the basics here. We will select six of the major techniques for our dis- cussion. Our intention is to understand these techniques broadly without getting down to technical details. The main goal is for you to get an overall appreciation of data mining techniques. Cluster Detection Clustering means forming groups. Take the very ordinary example of how you do your laundry. You group the clothes into whites, dark-colored clothes, light-colored clothes, MAJOR DATA MINING TECHNIQUES 409 Mining Techniques Mining Processes Examples of Mining Functions Application Area Fraud Detection Risk Assessment Market Analysis Credit card frauds Internal audits Warehouse pilferage Credit card upgrades Mortgage Loans Customer Retention Credit Ratings Market basket analysis Target marketing Cross selling Customer Relationship Marketing Determination of variations from norms Detection and analysis of links Predictive Modeling Database segmentation Cluster Detection Decision Trees Link Analysis Genetic Algorithms Decision Trees Memory-based Reasoning Data Visualization Memory-based Reasoning Figure 17-9 Data mining functions and application areas. permanent press, and the ones to be dry-cleaned. You have five distinct clusters. Each cluster has a meaning and you can use the meaning to get that cluster cleaned properly. The clustering helps you take specific and proper action for the individual pieces that make up the cluster. Now think of a specialty store owner in a resort community who wants to cater to the neighborhood by stocking the right type of products. If he has data about the age group and income level of each of the people who frequent the store, using these two variables, the store owner can probably put the customers into four clusters. These clusters may be formed as follows: wealthy retirees staying in resorts, middle-aged weekend golfers, wealthy young people with club memberships, and low-income clients who happen to stay in the community. The information about the clusters helps the store owner in his marketing. Clustering or cluster detection is one of the earliest data mining techniques. This tech- nique is designated as undirected knowledge discovery or unsupervised learning. What do we mean by this statement? In the cluster detection technique, you do not search preclas- sified data. No distinction is made between independent and dependent variables. For ex- ample, in the case of the store’s customers, there are two variables: age group and income level. Both variables participate equally in the functioning of the data mining algorithm. The cluster detection algorithm searches for groups or clusters of data elements that are similar to one another. What is the purpose of this? You expect similar customers or similar products to behave in the same way. Then you can take a cluster and do something useful with it. Again, in the example of the specialty store, the store owner can take the members of the cluster of wealthy retirees and target products specially interesting to them. Notice one important aspect of clustering. When the mining algorithm produces a clus- ter, you must understand what that cluster means exactly. Only then you will be able to do something useful with that cluster. The store owner has to understand that one of the clus- ters represents wealthy retirees residing in resorts. Only then can the store owner do some- thing useful with that cluster. It is not always easy to discern the meaning of every cluster the data mining algorithm forms. A bank may get as many as twenty clusters but be able to interpret the meanings of only two. But the return for the bank from the use of just these two clusters may be enormous enough so that they may simply ignore the other eighteen clusters. If there are only two or three variables or dimensions, it is fairly easy to spot the clus- ters, even when dealing with many records. But if you are dealing with 500 variables from 100,000 records, you need a special tool. How does the data mining tool perform the clus- tering function? Without getting bogged down in too much technical detail, let us study the process. First, some basics. If you have two variables, then points in a two-dimension- al graph represent the values of sets of these two variables. Please refer to Figure 17-10, which shows the distribution of these points. Let us consider an example. Suppose you want the data mining algorithm to form clus- ters of your customers, but you want the algorithm to use 50 different variables for each customer, not just two. Now we are discussing a 50-dimensional space. Imagine each cus- tomer record with different values for the 50 dimensions. Each record is then a vector defining a “point” in the 50-dimensional space. Let us say you want to market to the customers and you are prepared to run marketing campaigns for 15 different groups. So you set the number of clusters as 15. This number is K in the K-means clustering algorithm, a very effective one for cluster detection. Fifteen initial records (called “seeds”) are chosen as the first set of centroids based on best guess- 410 DATA MINING BASICS es. One seed represents one set of values for the 50 variables chosen from the customer record. In the next step, the algorithm assigns each customer record in the database to a cluster based on the seed to which it is closest. Closeness is based on the nearness of the values of the set of 50 variables in a record to the values in the seed record. The first set of 15 clusters is now formed. Then the algorithm calculates the centroid or mean for each of the first set of 15 clusters. The values of the 50 variables in each centroid are taken to rep- resent that cluster. The next iteration then starts. Each customer record is rematched with the new set of centroids and cluster boundaries are redrawn. After a few iterations the final clusters emerge. Now please refer to Figure 17-11 illustrating how centroids are determined and cluster boundaries redrawn. How does the algorithm redraw the cluster boundaries? What factors determine that one customer record is near one centroid and not the other? Each implementation of the cluster detection algorithm adopts a method of comparing the values of the variables in in- dividual records with those in the centroids. The algorithm uses these comparisons to cal- culate the distances of individual customer records from the centroids. After calculating the distances, the algorithm redraws the cluster boundaries. Decision Trees This technique applies to classification and prediction. The major attraction of decision trees is their simplicity. By following the tree, you can decipher the rules and understand why a record is classified in a certain way. Decision trees represent rules. You can use these rules to retrieve records falling into a certain category. Please examine Figure 17-12 showing a decision tree representing the profiles of men and women buying a notebook computer. MAJOR DATA MINING TECHNIQUES 411 Number of years as customer Total Value to the enterprise 25 years $ 1 million Figure 17-10 Clusters with two variables. 412 DATA MINING BASICS 1 3 2 1 Initial cluster boundaries based on initial seeds 3 Cluster boundaries redrawn at each iteration 2 Centroids of new clusters calculated Initial seed Calculated centroid Figure 17-11 Centroids and cluster boundaries. Portability Light Medium Speed Speed Storage Cost Pentium III Slower Pentium III Slower Comfortable Keyboard Keyboard Keyboard W W M M M W M M Cost Cost Cost StorageStorageStorageStorage <$2,500 More <$2,500 <$2,500 <$2,500 More More More 10GB Less 10GB 10GB 10GB 10GB Less Less Less Less Average Comfortable Comfortable Comfortable Average Average Average W M Keyboard M Figure 17-12 Decision tree for notebook computer buyers. In some data mining processes, you really do not care how the algorithm selected a cer- tain record. For example, when you are selecting prospects to be targeted in a marketing campaign, you do not need the reasons for targeting them. You only need the ability to predict which members are likely to respond to the mailing. But in some other cases, the reasons for the prediction are important. If your company is a mortgage company and wants to evaluate an application, you need to know why an application must be rejected. Your company must be able to protect itself from any lawsuits of discrimination. Wherev- er the reasons are necessary and you must be able to trace the decision paths, decision trees are suitable. As you have seen from Figure 17-12, a decision tree represents a series of questions. Each question determines what follow-up question is best to be asked next. Good ques- tions produce a short series. Trees are drawn with the root at the top and the leaves at the bottom, an unnatural convention. The question at the root must be the one that best differ- entiates among the target classes. A database record enters the tree at the root node. The record works its way down until it reaches a leaf. The leaf node determines the classifica- tion of the record. How can you measure the effectiveness of a tree? In the example of the profiles of buyers of notebook computers, you can pass the records whose classifications are al- ready known. Then you can calculate the percentage of correctness for the known records. A tree showing a high level of correctness is more effective. Also, you must pay attention to the branches. Some paths are better than others because the rules are better. By pruning the incompetent branches, you can enhance the predictive effectiveness of the whole tree. How do the decision tree algorithms build the trees? First, the algorithm attempts to find the test that will split the records in the best possible manner among the wanted clas- sifications. At each lower level node from the root, whatever rule works best to split the subsets is applied. This process of finding each additional level of the tree continues. The tree is allowed to grow until you cannot find better ways to split the input records. Memory-Based Reasoning Would you rather go to an experienced doctor or to a novice? Of course, the answer is ob- vious. Why? Because the experienced doctor treats you and cures you based on his or her experience. The doctor knows what worked in the past in several cases when the symp- toms were similar to yours. We are all good at making decisions on the basis of our expe- riences. We depend on the similarities of the current situation to what we know from past experience. How do we use the experience to solve the current problem? First, we identify similar instances in the past, then we use the past instances and apply the information about those instances to the present. The same principles apply to the memory-based rea- soning (MBR) algorithm. MBR uses known instances of a model to predict unknown instances. This data mining technique maintains a dataset of known records. The algorithm knows the characteristics of the records in this training dataset. When a new record arrives for evaluation, the algo- rithm finds neighbors similar to the new record, then uses the characteristics of the neigh- bors for prediction and classification. When a new record arrives at the data mining tool, first the tool calculates the “dis- tance” between this record and the records in the training dataset. The distance function of MAJOR DATA MINING TECHNIQUES 413 the data mining tool does the calculation. The results determine which data records in the training dataset qualify to be considered as neighbors to the incoming data record. Next, the algorithm uses a combination function to combine the results of the various distance functions to obtain the final answer. The distance function and the combination function are key components of the memory-based reasoning technique. Let us consider a simple example to observe how MBR works. This example is about predicting the last book read by new respondents based on a dataset of known responses. For the sake of keeping the example quite simple, assume there are four recent bestsellers. The students surveyed have read these books and have also mentioned which they had read last. The results of four surveys are shown in Figure 17-13. Look at the first part of the figure. Here you see the scatterplot of known respondents. The second part of the fig- ure contains the unknown respondents falling in place on the scatterplot. From where each unknown respondent falls on the scatterplot, you can determine the distance to the known respondents and then find the nearest neighbor. The nearest neighbor predicts the last book read by each unknown respondent. For solving a data mining problem using MBR, you are concerned with three critical issues: 1. Selecting the most suitable historical records to form the training or base dataset 2. Establishing the best way to compose the historical record 3. Determining the two essential functions, namely, the distance function and the com- bination function 414 DATA MINING BASICS 15 35302520 Age of students Four groups of respondents Best Sellers Age of students Four groups of respondents ? ? ? ? 15 35302520 nearest neighbor nearest neighbor nearest neighbor nearest neighbor Timeline The Greatest Generation The Last Precinct The O’Reilly Factor Figure 17-13 Memory-based reasoning. Link Analysis This algorithm is extremely useful for finding patterns from relationships. If you look at the business world closely, you clearly notice all types of relationships. Airlines link cities together. Telephone calls connect people and establish relationships. Fax machines con- nect with one another. Physicians prescribing treatments have links to the patients. In a sale transaction at a supermarket, many items bought together in one trip are all linked to- gether. You notice relationships everywhere. The link analysis technique mines relationships and discovers knowledge. For exam- ple, if you look at the supermarket sale transactions for one day, why are skim milk and brown bread found in the same transaction about 80% of the time? Is there a strong rela- tionship between the two products in the supermarket basket? If so, can these two prod- ucts be promoted together? Are there more such combinations? How can we find such links or affinities? Pursue another example, casually mentioned above. For a telephone company, finding out if residential customers have fax machines is a useful proposition. Why? If a residen- tial customer uses a fax machine, then that customer may either want a second line or want to have some kind of upgrade. By analyzing the relationships between two phone numbers established by the calls along with other stipulations, the desired information can be discovered. Link analysis algorithms discover such combinations. Depending upon the types of knowledge discovery, link analysis techniques have three types of applications: associations discovery, sequential pattern discovery, and similar time sequence discovery. Let us briefly discuss each of these applications. Associations Discovery. Associations are affinities between items. Association dis- covery algorithms find combinations where the presence of one item suggests the pres- ence of another. When you apply these algorithms to the shopping transactions at a super- market, they will uncover affinities among products that are likely to be purchased together. Association rules represent such affinities. The algorithms derive the association rules systematically and efficiently. Please see Figure 17-14 presenting an association rule and the annotated parts of the rule. The two parts—support factor and the confidence fac- tor—indicate the strength of the association. Rules with high support and confidence fac- tor values are more valid, relevant, and useful. Simplicity makes association discovery a popular data mining algorithm. There are only two factors to be interpreted and even these tend to be intuitive for interpretation. Because the technique essentially involves counting the combinations as the dataset is read repeatedly each time new dimensions are added, scaling does pose a major problem. Sequential Pattern Discovery. As the name implies, these algorithms discover pat- terns where one set of items follows another specific set. Time plays a role in these pat- terns. When you select records for analysis, you must have date and time as data items to enable discovery of sequential patterns. Let us say you want the algorithm to discover the buying sequence of products. The sale transactions form the dataset for the data mining operation. The data elements in the sale transaction may consist of date and time of transaction, products bought during the transaction, and the identification of the customer who bought the items. A sample set of these transactions and the results of applying the algorithm are shown in Figure 17-15. Notice the discovery of the sequential pattern. Also notice the support factor that gives an indication of the relevance of the association. MAJOR DATA MINING TECHNIQUES 415 416 DATA MINING BASICS A customer in a super- market also buys milk in 65% of the cases whenever the customer buys bread, this happening for 20% of all purchases. Association rule head Association rule body Confidence Factor Support Factor Figure 17-14 An association rule. Figure 17-15 Sequential pattern discovery. NAME OF CUSTOMER PRODUCT SEQUENCE FOR CUSTOMER John Brown Desktop PC, MP3 Player, Digital Camera Cindy Silverman Desktop PC, MP3 Player, Digital Camera, Tape Backup Drive Robert Stone Laptop PC, Digital Camera Terry Goldsmith Laptop PC, Digital Camera Richard McKeown Desktop PC, MP3 Player SEQUENTIAL PATTERNS (Support Factor > 60%) SUPPORTING CUSTOMERS Desktop PC, MP3 Player John Brown, Cindy Silverman, Richard McKeown Sequential Pattern Discovery with Support Factors SEQUENTIAL PATTERNS (Support Factor > 40%) SUPPORTING CUSTOMERS Desktop PC, MP3 Player, Digital Camera John Brown, Cindy Silverman Laptop PC, Digital Camera Robert Stone, Terry Goldsmith SALE DATE NAME OF CUSTOMER PRODUCTS PURCHASED Nov. 15, 2000 John Brown Desktop PC, MP3 Player Nov. 15, 2000 Cindy Silverman Desktop PC, MP3 Player, Digital Camera Nov. 15, 2000 Robert Stone Laptop PC Dec. 19, 2000 Terry Goldsmith Laptop PC Dec. 19, 2000 John Brown Digital Camera Dec. 19, 2000 Terry Goldsmith Digital Camera Dec. 19, 2000 Robert Stone Digital Camera Dec. 20, 2000 Cindy Silverman Tape Backup Drive Dec. 20, 2000 Richard McKeown Desktop PC, MP3 Player Transaction Data File Sequential Patterns Customer Sequence Typical discoveries include associations of the following types: ț Purchase of a digital camera is followed by purchase of a color printer 60% of the time ț Purchase of a desktop is followed by purchase of a tape backup drive 65% of the time ț Purchase of window curtains is followed by purchase of living room furniture 50% of the time Similar Time Sequence Discovery. This technique depends on the availability of time sequences. In the previous technique, the results indicate sequential events over time. This technique, however, finds a sequence of events and then comes up with other similar sequences of events. For example, in retail department stores, this data mining technique comes up with a second department that has a sales stream similar to the first. Finding similar sequential price movements of stock is another application of this technique. Neural Networks Neural networks mimic the human brain by learning from a training dataset and applying the learning to generalize patterns for classification and prediction. These algorithms are effective when the data is shapeless and lacks any apparent pattern. The basic unit of an artificial neural network is modeled after the neurons in the brain. This unit is known as a node and is one of the two main structures of the neural network model. The other struc- ture is the link that corresponds to the connection between neurons in the brain. Please see Figure 17-16 illustrating the neural network model. Let us consider a simple example to understand how a neural network makes a predic- MAJOR DATA MINING TECHNIQUES 417 VALUES FOR INPUT VARIABLES DISCOVERED VALUE FOR OUTPUT VARIABLE Nodes Links Output from node Input to next node Input values weighted INPUT OUTPUT Figure 17-16 Neural network model. tion. The neural network receives values of the variables or predictors at the input nodes. If there are 15 different predictors, then there are 15 input nodes. Weights may be applied to the predictors to condition them properly. Now please look at Figure 17-17 indicating the working of a neural network. There may be several inner layers operating on the pre- dictors and they move from node to node until the discovered result is presented at the output node. The inner layers are also known as hidden layers because as the input dataset is running through many iterations, the inner layers rehash the predictors over and over again. Genetic Algorithms In a way, genetic algorithms have something in common with neural networks. This tech- nique also has its basis in biology. It is said that evolution and natural selection promote the survival of the fittest. Over generations, the process propagates the genetic material in the fittest individuals from one generation to the next. Genetic algorithms apply the same principles to data mining. This technique uses a highly iterative process of selection, cross-over, and mutation operators to evolve successive generations of models. At each it- eration, every model competes with everyone other by inheriting traits from previous ones until only the most predictive model survives. Let us try to understand the evolution of successive generations in genetic algorithms by using a very popular example used by many authors. This is the problem to be solved: Your company is doing a promotional mailing and wants to include free coupons in the mailing. Remember, this is a promotional mailing with the goal of increasing profits. At the same time, the promotional mailing must not produce the opposite result of lost rev- enue. This is the question: What is the optimum number of coupons to be placed in each mailer to maximize profits? At first blush, it looks like mailing out as many coupons as possible might be the solu- tion. Will this not enable the customers to use all the available coupons and maximize profits? However, some other factors seem to complicate the problem. First, the more coupons in the mailer, the higher the postal costs are going to be. The increased mailing 418 DATA MINING BASICS Age 35 Income $75,000 Upgrade to Gold Credit Card Pre- approved 0.35 0.75 Weight = 0.9 Weight = 1.0 1.065 Neural Network for pre- approval of Gold Credit Card [Upgrade pre- approved if output value >1.0] Figure 17-17 How a neural network works. [...]... levels of data Notice the detail and summary data structures Think further how the data structures are implemented in physical storage as files, blocks, and records Data Staging Area Data Extract Flat Files Relational Database data files (transformed data) Relational Database index files Load Image Flat Files Figure 18-6 Data Warehouse Repository Relational Database data files (warehouse data) Partitioned... generate large volumes of detailed transaction data Such data is suitable for data mining Data mining applications at banks are quite varied Fraud detection, risk assessment of potential customers, trend analysis, and direct marketing are the primary data mining applications at banks In the financial area, requirements for forecasting dominate Forecasting of stock prices and commodity prices with a high... mining, a sound and solid data warehouse will put the data mining operation on a strong foundation As mentioned earlier, data mining techniques produce good results when large volumes of data are available Almost all the algorithms need data at the lowest grain Consider having data at the detailed level in your data warehouse Another important point refers to the quality of the data Data mining is about... means exhaustive, but it covers the essential points Data Access The data mining tool must be able to access data sources such as the data warehouse and quickly bring over the required datasets to its environment On many occasions you may need data from other sources to augment the data extracted from the data warehouse The tool must be capable of reading other data sources and input formats Data Selection... selecting and extracting data for mining, the tool must be able to perform its operations according to a variety of criteria Selection abilities must include filtering out of unwanted data and deriving new data items from existing ones 422 DATA MINING BASICS Sensitivity to Data Quality Because of its importance, data quality is worth mentioning again The data mining tool must be sensitive to the quality... the activities that make administration easy For instance, ease of administration includes methods for proper arrangement of table rows in storage so that frequent reorganization is avoided Another area for ease of administration is in the back up and recovery of database tables Review the various data warehouse administration tasks Make it easy for administration whenever it comes to working with... banking application A B C D E F G H I J reveals reasons for the discovery neural networks distance function feeds data for mining data- driven fraud detection user-driven forms groups highly iterative associations discovery 2 As a data mining consultant, you are hired by a large commercial bank that provides many financial services The bank already has a data warehouse that it rolled out two years ago... following names are meaningful and the standards are adequate: sale_units_daily_stage customer_daily_update product_full_refresh order_entry_initial_extract all_sources_sales_extract customer_nameaddr_daily_update Standards for Physical Files Your standards must include naming conventions for all types of files These files are not restricted to data and index files for the data warehouse database There are... look at all the data related to the data warehouse First, you have the data in the staging area Though you may look for efficiency in storage and loading, arrangement of the data in the staging area does not contribute to the performance of the data warehouse from the point of view of the users Looking further, the other sets of data relate to the data content of the warehouse These are the data and index... techniques for the data warehouse environment ț Review and summarize all performance enhancement options As an IT professional, you are familiar with logical and physical models You have probably worked with the transformation of a logical model into a physical model You also know that completing the physical model has to be tied to the details of the platform, the database software, hardware, and any third-party . visu- alization capabilities. Extensibility. The tool architecture must be able to integrate with the data warehouse administration and other functions such as data extraction and metadata manage- ment. Performance transaction data. Such data is suitable for data mining. Data mining applications at banks are quite varied. Fraud detection, risk assessment of potential customers, trend analysis, and direct marketing. available. Almost all the algorithms need data at the lowest grain. Con- sider having data at the detailed level in your data warehouse. Another important point refers to the quality of the data. Data