Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1592 Part X Business Intelligence for easy refresh as data changes. PivotTables and PivotCharts provide a flexible analysis environment for both relational and Analysis Services data that queries the current database contents for every update. SQL Server Data Mining Add-ins for Office 2007 make a host of sophisticated data mining capabilities available within the familiar Excel environment. Use the Data Preparation features to explore and clean data, and the Table Analysis Tools to perform a number of common analyses with minimal effort. Excel even offers an alternative to Visual Studio as a client for data mining in Analysis Services. Most every organization can benefit from better data availability and analysis and Excel is a great place to start. 1592 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1593 Data Mining with Analysis Services IN THIS CHAPTER Overview of the data mining process Creating mining structures and models Evaluating model accuracy Deploying data mining functionality in applications Mining algorithms and viewers Mining integration with Analysis Services Cubes M any business questions can be answered directly by querying a database — for example, ‘‘What is the most popular page on our web- site?’’ or ‘‘Who are our top customers?’’ Other, often more important, questions require deeper exploration — for example, the most popular paths through the website or common characteristics of top customers. Data mining provides the tools to answer such non-obvious questions. The term data mining has suffered from a great deal of misuse. One favorite anec- dote is the marketing person who intended to ‘‘mine’’ data in a spreadsheet by staring at it until inspiration struck. In this book, data mining is not something performed by intuition, direct query, or simple statistics. Instead, it is the algo- rithmic discovery of non-obvious information from large quantities of data. Analysis Services implements algorithms to extract information addressing several categories of questions: ■ Segmentation: Groups items with similar characteristics. For example, develop profiles of top customers or spot suspect values on a data entry page. ■ Classification: Places items into categories. For example, determine which customers are likely to respond to a marketing campaign or which e-mails are likely to be spam. ■ Association: Sometimes called market basket analysis, this determines which items tend to occur together. For example, which web pages are normally viewed together on the site, or ‘‘Customers who bought this book also bought ’’ ■ Estimation: Estimates a value. For example, estimating revenue from a customer or the life span of a piece of equipment. ■ Forecasting: Predicts what a time series will look like in the future. For example, when will we run out of disk space, or what revenue do we expect in the upcoming quarter? 1593 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1594 Part X Business Intelligence ■ Sequence analysis: Determines what items tend to occur together in a specific order. For example, what are the most common paths through our website? Or, in what order are products normally purchased? These categories are helpful for thinking about how data mining can be used, but with increased com- fort level and experience, many other applications are possible. The Data Mining Process A traditional use of data mining is to train a data mining model using data for which an outcome is already known and then use that model to predict the outcome of new data as it becomes available. This use of data mining requires several steps, only some of which happen within Analysis Services: ■ Business and data understanding: Understand the questions that are important and the data available to answer those questions. Insights gained must be relevant to business goals to be of use. Data must be of acceptable quality and relevance to obtain reliable answers. ■ Prepare data: The effort to get data ready for mining can range from simple to painstaking depending on the situation. Some of the tasks to consider include the following: ■ Eliminate rows of low data quality. Here, the measure of quality is domain specific, but it may include too small an underlying sample size, values outside of expected norms, or failing any test that proves the row describes an impossible or highly improbable case. ■ General cleaning by scaling, formatting, and so on; and by eliminating duplicates, invalid values, or inconsistent values. ■ Analysis Services accepts a single primary case table, and optionally one or more child nested tables. If the source data is spread among several tables, then denormalization by creating views or preprocessing will be required. ■ Erratic time series data may benefit from smoothing. Smoothing algorithms remove the dramatic variations from noisy data at the cost of accuracy, so experimentation may be necessary to choose an algorithm that does not adversely impact the data mining outcome. ■ Derived attributes can be useful in the modeling process, typically either calculating a value from other attributes (e.g., Profit = Income − Cost) or simplifying the range of a complex domain (e.g., mapping numeric survey responses to High, Medium, or Low). Some types of preparation can be accomplished within the Analysis Services data source view using named queries and named calculations. When possible, this is highly recom- mended, as it avoids reprocessing data sets if changes become necessary. ■ Finally, it is necessary to split the prepared data into two data sets: A training data set that is used to set up the model, and a testing data set that is used to evaluate the model’s accuracy. Testing data can be held out either in the mining structure itself or during the data preparation process. The Integration Services Row Sampling and Percentage Sampling transforms are useful to randomly split data, typically saving 20 to 30 percent of rows for testing. ■ Model: Analysis Services models are built by first defining a data mining structure that speci- fies the tables to be used as input. Then, data mining models (different algorithms) are added to the structure. Finally, all the models within the structure are trained simultaneously using the training data. 1594 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1595 Data Mining with Analysis Services 76 ■ Evaluate: Evaluating the accuracy and usefulness of the candidate mining models is simpli- fied by Analysis Services’ Mining Accuracy Chart. Use the testing data set to understand the expected accuracy of each model and compare it to business needs. ■ Deploy: Integrate prediction queries into applications to predict the outcomes of interest. For a more detailed description of the data mining process, see www.crisp-dm.org. While this process is typical of data mining tasks, it does not cover every situation. Occasionally, explor- ing a data set is an end in itself, providing a better understanding of the data and its relationships. The process in this case may just iterate between prepare/model/evaluate cycles. At the other end of the spec- trum, an application may build, train, and query a model to accomplish a task, such as identifying out- lier rows in a data set. Regardless of the situation, understanding this typical process will aid in building appropriate adaptations. Modeling with Analysis Services Open an Analysis Services project within Business Intelligence Development Studio to create a data min- ing structure. When deployed, the Analysis Services project will create an Analysis Services database on the target server. Often, data mining structures are deployed in conjunction with related cubes in the same database. Begin the modeling process by telling Analysis Services where the training and testing data reside: ■ Define data source(s) that reference the location of data to be used in modeling. ■ Create data source views that include all training tables. When nested tables are used, the data source view must show the relationship between the case and nested tables. For information on creating and managing data sources and data source views, see Chapter 71, ‘‘Building Multidimensional Cubes with Analysis Services.’’ Data Mining Wizard The Data Mining Wizard steps through the process of defining a new data mining structure and option- ally the first model within that structure. Right-click on the Mining Structures node within the Solution Explorer and choose New Mining Model to start the wizard. The wizard consists of several pages: ■ Select the Definition Method: Options include relational (from existing relational database or data warehouse) or cube (from existing cube) source data. For this example, choose relational. (See the section ‘‘OLAP Integration’’ later in this chapter for differences between relational-based and cube-based mining structures.) ■ Create the Data Mining Structure: Choose the algorithm to use in the structure’s first mining model. (See the ‘‘Algorithms’’ section in this chapter for common algorithm usage). Alternately, a mining structure can be created with no models, and one or more models can be added to the structure later. ■ Select Data Source View: Choose the data source view containing the source data table(s). ■ Specify Table Types: Choose the case table containing the source data and any associated nested tables. Nested tables always have one-to-many relationships with the case table, such as a list of orders as the case table, and associated order line items in the nested table. 1595 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1596 Part X Business Intelligence ■ Specify the Training Data: Categorize columns by their use in the mining structure. When a column is not included in any category, it is omitted from the structure. Categories are as follows: ■ Key: Choose the columns that uniquely identify a row in the training data. By default, the primary key shown in the data source view will be marked as the key. ■ Input: Mark each column that may be used in prediction — generally this includes the predictable columns as well. The Suggest button may aid in selection once the predictable columns have been identified by scoring columns by relevance based on a sample of the training data, but take care to avoid inputs with values that are unlikely to occur again as input to a trained model. For example, a customer ID, name, or address might be very effective at training a model, but once the model is built to look for a specific ID or address, it is very unlikely new customers will ever match those values. Conversely, gender and occupation values are very likely to reappear in new customer records. ■ Predictable: Identify all columns the model should be able to predict. ■ Specify Columns’ Content and Data Type: Review and adjust the data type (Boolean, Date, Double, Long, Text) as needed. Review and adjust the content type as well; pressing the Detect button to calculate continuous versus discrete for numeric data types may help. Available content types include the following: ■ Key: Contains a value that, either alone or with other keys, uniquely identifies a row in the training table. ■ Key Sequence: Acts as a key and provides order to the rows in a table. It is used to order rows for the sequence-clustering algorithm. ■ Key Time: Acts as a key and provides order to the rows in a table based on a time scale. It is used to order rows for the time series algorithm. ■ Continuous: Continuous numeric data — often the result of some calculation or measure- ment, such as age, height, or price. ■ Discrete: Data that can be thought of as a choice from a list, such as occupation, model, or shipping method. ■ Discretized: Analysis Services will transform a continuous column into a set of discrete buckets, such as ages 0–10, 11–20, and so on. In addition to choosing this option, other column properties must be set once the wizard is complete. Open the mining structure, select the column, and then set the DiscretizationBucketCount and DiscretizationMethod properties to direct how the ‘‘bucketization’’ will be performed. ■ Ordered: Defines an ordering on the training data but without assigning significance to the values used to order. For example, if values of 5 and 10 are used to order two rows, then 10 simply comes after 5; it is not ‘‘twice as good’’ as 5. ■ Cyclical: Similar to ordered data but repeats values, thus defining a cycle in the data, such as day of month or month of quarter. This enables the mining model to account for cycles in the data such as sales peaks at the end of a quarter or annually during the holidays. ■ Create Testing Set: In SQL Server 2008, the mining structure can hold both the training and the testing data directly, instead of manually splitting the data into separate tables. Specify the percentage or number of rows to be held out for testing models in this structure if testing data is included in the source table(s). 1596 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1597 Data Mining with Analysis Services 76 ■ Completing the Wizard: Provide names for the overall mining structure and the first min- ing model within that structure. Select Allow Drill Thru to enable the direct examination of training cases from within the data mining viewers. Once the wizard finishes, the new mining structure with a single mining model is created, and the new structure is opened in the Data Mining Designer. The initial Designer view, Mining Structure, enables columns to be added or removed from the structure, and column properties, such as Content (type) or DiscretizationMethod,tobemodified. Mining Models view The Mining Models view of the Data Mining Designer enables different data mining algorithms to be configured on the data defined by the mining structure. Add new models as follows (see Figure 76-1): FIGURE 76-1 Adding a new model to an existing structure 1. Right-click the structure/model matrix pane and choose New Mining Model. 2. Supply a name for the model. 3. Select the desired algorithm and click OK. 1597 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1598 Part X Business Intelligence Depending on the structure definition, not all algorithms will be available — for example, the Sequence Clustering algorithm requires that a Key Sequence column be defined, while the Time Series algorithm requires a Key Time column to be defined. In addition, not every algorithm will use each column in the same way — for example, some algorithms ignore continuous input columns (consider using discretiza- tion on these columns). SQL Server 2008 allows filters to be placed on models, which can be useful when training models spe- cific to a subset of the source data. For example, targeting different customer groups can be performed by training filtered models in a single mining structure. Right-click on a model and choose Set Model Filter to apply a filter to a model. Once set, the current filter is viewable in the model’s properties. In addition to the optional model filter, each mining model has both properties and algorithm parame- ters. Select a model (column) to view and change the properties common to all algorithms in the Prop- erties pane, including Name, Description, and AllowDrillThrough. Right-click on a model and choose Set Algorithm Parameters to change an algorithm’s default settings. Once both the structure and model definitions are in place, the structure must be deployed to the target server to process and train the models. The process of deploying a model consists of two parts: 1. During the build phase, the structure definition (or changes to the definition as appropriate) is sent to the target Analysis Services Server. Examine the progress of the build in the output pane. 2. During the process phase, the Analysis Services server queries the source data, caches that data in the mining structure, and trains the models with all the data that has not been either filtered out or held out for testing. Before the first time a project is deployed, set the target server by right-clicking on the project in the Solution Explorer pane containing the mining structure and choose Properties. Then, select the Deploy- ment topic and enter the appropriate server name, adjusting the target database name at the same time (deploying creates an Analysis Services database named, by default, after the project). Deploy the structure by choosing either Process Model or Process Mining Structure and All Models from the context menu. The same options are available from the Mining Model menu as well. After processing, the Mining Model Viewer tab contains processing results; here, one or more viewers are available depending on which models are included in the structure. The algorithm-specific viewers assist in understanding the rules and relationships discovered by the models (see the ‘‘Algorithms’’ section later in this chapter). Model evaluation Evaluate the trained models to determine which model predicts the outcome most reliably, and to decide whether the accuracy will be adequate to meet business goals. The Mining Accuracy Chart view provides tools for performing the evaluation. The charts visible within this view are enabled by supplying data for testing under the Input Selection tab. Choose one of three sources: ■ Use mining model test cases: Uses test data held out in the mining structure but applies any model filters in selecting data for each model ■ Use mining structure test cases: Uses test data held out in the mining structure, ignoring any model filters 1598 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1599 Data Mining with Analysis Services 76 ■ Specify a different data set: Allows the selection and mapping of an external table to supply test data. After selecting this option, press the ellipses to display the Specify Column Mapping dialog. Then, press the Select Case Table button on the right-hand table and choose the table containing the test data. The joins between the selected table and the mining structure will map automatically for matching column names, or they can be manually mapped by drag-and- drop when a match is not found. Verify that each non-key column in the mining structure participates in a join. If the value being predicted is discrete, then the Input Selection tab also allows choosing a particular outcome for evaluation. If a Predict Value is not selected, then accuracy for all outcomes is evaluated. Lift charts and scatter plots Once the source data and any Predict Value have been specified, switch to the Lift Chart tab, and ver- ify that Lift Chart (Scatter Plot for continuous outcomes) is selected from the Chart Type list box (see Figure 76-2). Because the source data contains the predicted column(s), the lift chart can compare each model’s prediction against the actual outcome. The lift chart plots this information on the Target Popula- tion % (percent of cases correct) versus Overall Population % (percent of cases tested) axes, so when 50 percent of the population has been checked, the perfect model will have predicted 50 percent correctly. In fact, the chart automatically includes two useful reference lines: the Ideal Model, which indicates the best possible performance, and the Random Guess, which indicates how often randomly assigned out- comes happen to be correct. The profit chart extends the lift chart and aids in calculating the maximum return from marketing cam- paigns and similar efforts. Press the Settings button to specify the number of prospects, the fixed and per-case cost, and the expected return from a successfully identified case; then choose Profit Chart from the Chart Type list box. The resulting chart indicates profit versus population percent included, offering a guide as to how much of the population should be included in the effort either by maximizing profit or by locating a point of diminishing returns. Classification matrix The simplest view of model accuracy is offered by the Classification Matrix tab, which creates one table for each model, with predicted outcomes listed down the left side of the table and actual values across the top, similar to the example shown in Table 76-1. This example shows that for red cases, this model correctly predicted red for 95 and incorrectly predicted blue for 37. Likewise, for cases that were actu- ally blue, the model correctly predicted blue 104 times while incorrectly predicting red 21 times. TABLE 76-1 Example Classification Matrix Predicted Red (Actual) Blue (Actual) Red 95 21 Blue 37 104 1599 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1600 Part X Business Intelligence The classification matrix is not available for predicting continuous outcomes. FIGURE 76-2 Lift Chart tab Cross validation Cross validation is a very effective technique for evaluating a model for stability and how well it will generalize for unseen cases. The concept is to partition available data into some number of equal sized buckets called folds, and then train the model on all but one of those folds and test with the remaining fold, repeating until each of the folds has been used for testing. For example, if three folds were selected, the model would be trained on 2 and 3 andtestedwith1,thentrainedon1and3andtested on2,andfinallytrainedon1and2andtestedon3. Switch to the Cross Validation tab and specify the parameters for the evaluation: ■ Fold Count: The number of partitions into which the data will be placed. ■ Max Cases: The number of cases from which the folds will be constructed. For example, 1,000 cases and 10 folds will result in approximately 100 cases per fold. Because of the large amount of processing required to perform cross validation, it is often useful to limit the number of cases. Setting this value to 0 results in all cases being used. 1600 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1601 Data Mining with Analysis Services 76 ■ Target Attribute and State: The prediction to validate. ■ Target Threshold: Sets the minimum probability required before assuming a positive result. For example, if you were identifying customers for an expensive marketing promotion, a mini- mum threshold of 80 percent likely to purchase could be set to target only the best prospects. Knowing that this threshold will be used enables a more realistic evaluation of the model. Once the cross-validation has run, a report like the one shown in Figure 76-3 displays the outcome for each fold across a number of different measures. In addition to how well a model performs in each line item, the standard deviation of the results of each measure should be relatively small. If the variation is large between folds, then it is an indication that the model will not generalize well in practical use. FIGURE 76-3 Cross Validation tab Troubleshooting models Models seldom approach perfection in the real world. If these evaluation techniques show a model falling short of your needs, then consider these common problems: ■ A non-random split of data into training and test data sets. If the split method used was based on a random algorithm, rerun the random algorithm to obtain a more random result. 1601 www.getcoolebook.com . sales peaks at the end of a quarter or annually during the holidays. ■ Create Testing Set: In SQL Server 2008, the mining structure can hold both the training and the testing data directly, instead. algorithms ignore continuous input columns (consider using discretiza- tion on these columns). SQL Server 2008 allows filters to be placed on models, which can be useful when training models spe- cific. relational and Analysis Services data that queries the current database contents for every update. SQL Server Data Mining Add-ins for Office 2007 make a host of sophisticated data mining capabilities available