Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1602 Part X Business Intelligence ■ Input columns are too case specific (e.g., IDs, names, etc.). Adjust the mining structure to ignore data items containing values that occur in the training data but then never reappear for test or production data. ■ Too few rows (cases) in the training data set to accurately characterize the population of cases. Look for additional sources of data for best results. If additional data is not available, then better results may be obtained by limiting the special cases considered by an algorithm (e.g., increasing the MINIMUM_SUPPORT parameter). ■ If all models are closer to the Random Guess line than the Ideal Model line, then the input data does not correlate with the outcome being predicted. Note that some algorithms, such as Time_Series, do not support the Mining Accuracy Chart view at all. Regardless of the tools available within the development environment, it is important to perform an evaluation of the trained model using test data held in reserve for that purpose. Then, modify the data and model definitions until the results meet the business goals at hand. Deploying Several methods are available for interfacing applications with data mining functionality: ■ Directly constructing XMLA, communicating with Analysis Services via SOAP. This exposes all functionality at the price of in-depth programming. ■ Analysis Management Objects (AMO) provides an environment for creating and managing mining structures and other meta-data, but not for prediction queries. ■ The Data Mining Extensions (DMX) language supports most model creation and training tasks and has a robust prediction query capability. DMX can be sent to Analysis Services via the following: ■ ADOMD.NET for managed (.NET) languages ■ OLE DB for C++ code ■ ADO for other languages DMX is a SQL-like language modified to accommodate mining structures and tasks. For purposes of per- forming prediction queries against a trained model, the primary language feature is the prediction join. As the following code example shows, the prediction join relates a mining model and a set of data to be predicted (cases). Because the DMX query is issued against the Analysis Services database, the model [TM Decision Tree] can be directly referenced, while the cases must be gathered via an OPENQUERY call against the relational database. The corresponding columns are matched in the ON clause like a standard relational join, and the WHERE and ORDER BY clauses function as expected. DMX also adds a number of mining-specific functions such as the Predict and PredictProbability functions shown here, which return the most likely outcome and the probabil- ity of that outcome, respectively. Overall, this example returns a list of IDs, names, and probabilities for prospects who are more than 60 percent likely to purchase a bike, sorted by descending probability: SELECT t.ProspectAlternateKey,t.FirstName, t.LastName, PredictProbability([TM Decision Tree].[Bike Buyer]) as Prob FROM [TM Decision Tree] PREDICTION JOIN OPENQUERY([Adventure Works DW], 1602 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1603 Data Mining with Analysis Services 76 ‘SELECT ProspectAlternateKey, FirstName, LastName, MaritalStatus, Gender, YearlyIncome, TotalChildren, NumberChildrenAtHome, Education, Occupation, HouseOwnerFlag, NumberCarsOwned, StateProvince- Code FROM dbo.ProspectiveBuyer;’) AS t ON [TM Decision Tree].[Marital Status] = t.MaritalStatus AND [TM Decision Tree].Gender = t.Gender AND [TM Decision Tree].[Yearly Income] = t.YearlyIncome AND [TM Decision Tree].[Total Children] = t.TotalChildren AND [TM Decision Tree].[Number Children At Home] = t.NumberChildrenAtHome AND [TM Decision Tree].Education = t.Education AND [TM Decision Tree].Occupation = t.Occupation AND [TM Decision Tree].[House Owner Flag] = t.HouseOwnerFlag AND [TM Decision Tree].[Number Cars Owned] = t.NumberCarsOwned AND [TM Decision Tree].Region = t.StateProvinceCode WHERE PredictProbability([TM Decision Tree].[Bike Buyer]) > 0.60 AND Predict([TM Decision Tree].[Bike Buyer])=1 ORDER BY PredictProbability([TM Decision Tree].[Bike Buyer]) DESC Another useful form of the prediction join is a singleton query, whereby data is provided directly by the application instead of read from a relational table, as shown in the next example. Because the names exactly match those of the mining model, a NATURAL PREDICTION JOIN is used, not requiring an ON clause. This example returns the probability that the listed case will purchase a bike (i.e., [Bike Buyer]=1 ): SELECT PredictProbability([TM Decision Tree].[Bike Buyer],1) FROM [TM Decision Tree] NATURAL PREDICTION JOIN (SELECT 47 AS [Age], ‘2-5 Miles’ AS [Commute Distance], ‘Graduate Degree’ AS [Education], ‘M’ AS [Gender], ‘1’ AS [House Owner Flag], ‘M’ AS [Marital Status], 2 AS [Number Cars Owned], 0 AS [Number Children At Home], ‘Professional’ AS [Occupation], ‘North America’ AS [Region], 0 AS [Total Children], 80000 AS [Yearly Income]) AS t Business Intelligence Development Studio aids in the construction of DMX queries via the Query Builder within the mining model prediction view. Just like the Mining Accuracy Chart, select the model and case table to be queried, or alternately press the singleton button in the toolbar to specify values. Specify SELECT columns and prediction functions in the grid at the bottom. SQL Server Management Studio also offers a DMX query type with meta-data panes for drag-and-drop access to mining structure column names and prediction functions. Numerous prediction functions are available, including the following: ■ Predict: Returns the expected outcome for a predictable column ■ PredictProbability: Returns the probability (between 0 and 1) of the expected outcome, or for a specific case if specified 1603 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1604 Part X Business Intelligence ■ PredictSupport: Returns the number of training cases on which the expected outcome is based, or on which a specific case is based if specified ■ PredictHistogram: Returns a nested table with all possible outcomes for a given case, listing probability, support, and other information for each outcome ■ Cluster: Returns the cluster to which a case is assigned (clustering algorithm specific) ■ ClusterProbability: Returns the probability the case belongs to a given cluster (clustering algorithm specific). ■ PredictSequence: Predicts the next values in a sequence (sequence clustering algorithm specific) ■ PredictAssociation: Predicts associative membership (association algorithm specific) ■ PredictTimeSeries: Predicts future values in a time series (time series algorithm specific). Like PredictHistogram, this function returns a nested table. Algorithms When working with data mining, it is useful to understand mining algorithm basics and when to apply each algorithm. Table 76-2 summarizes common algorithm usage for the problem categories presented at the beginning of this chapter. TABLE 76-2 Common Mining Algorithm Usage Problem Type Primary Algorithms Segmentation Clustering, Sequence Clustering Classification Decision Trees, Naive Bayes, Neural Network, Logistic Regression Association Association Rules, Decision Trees Estimation Decision Trees, Linear Regression, Logistic Regression, Neural Network Forecasting Time Series Sequence Analysis Sequence Clustering These usage guidelines are useful as an orientation, but not every data mining problem falls neatly into one of these types, and other algorithms will work for several of these problem types. Fortunately, with evaluation tools such as the lift chart, it’s usually simple to identify which algorithm provides the best results for a given problem. Decision tree This algorithm is the most accurate for many problems. It operates by building a decision tree beginning with the All node, corresponding to all the training cases (see Figure 76-4). Then, an attribute is chosen 1604 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1605 Data Mining with Analysis Services 76 that best splits those cases into groups, and each of those groups is examined for an attribute that best splits those cases, and so on. The goal is to generate leaf nodes with a single predictable outcome. For example, if the goal is to identify who will purchase a bike, then leaf nodes should contain cases that are either bike buyers or not bike buyers, but no combinations (or as close to that goal as possible). FIGURE 76-4 Decision Tree Viewer The Decision Tree Viewer shown in Figure 76-4 graphically displays the resulting tree. Age is the first attribute chosen in this example, splitting cases into groups such as under 35, 35 to 42, and so on. For the under-35 crowd, Number Cars Owned was chosen to further split the cases, while Commute Dis- tance was chosen for the 56 to 70 cases. The Mining Legend pane displays the details of any selected node, including how the cases break out by the predictable variable (in this case, 796 buyers and 1,538 non-buyers) both in count and probability. Many more node levels can be expanded using the Show Level control in the toolbar or the expansion controls (+/-) on each node. Note that much of the tree is not expanded in this figure due to space restrictions. The Dependency Network Viewer is also available for decision trees, displaying both input and predictable columns as nodes, with arrows indicating what predicts what. Move the slider to the bottom to see only the most significant predictions. Click on a node to highlight its relationships. 1605 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1606 Part X Business Intelligence Linear regression The linear regression algorithm is implemented as a variant of decision trees and is a good choice for con- tinuous data that relates more or less linearly. The result of the regression is an equation in the form Y = B 0 + A 1 ∗ (X 1 + B 1 ) + A 2 ∗ (X 2 + B 2 ) + where Y is the column being predicted, X i is the input columns, and A i /B i are constants determined by the regression. Because this algorithm is a special case of decision trees, it shares the same mining view- ers. While, by definition, the Tree Viewer will show a single All node, the Mining Legend pane displays the prediction equation. The equation can be either used directly or queried in the mining model via the Predict function. The Dependency Network Viewer provides a graphical interpretation of the weights used in the equation. Clustering The clustering algorithm functions by gathering similar cases together into groups called clusters and then iteratively refining the cluster definition until no further improvement can be gained. This approach makes clustering uniquely suited for segmentation/profiling of populations. Several viewers display data from the finished model: ■ Cluster Diagram: This viewer displays each cluster as a shaded node with connecting lines between similar clusters — the darker the line, the more similar the cluster. Move the slider to the bottom to see only lines connecting the most similar clusters. Nodes are shaded darker to represent more cases. By default, the cases are counted from the entire population, but changing the Shading Variable and State pull-downs specifies shading to be based on particular variable values (e.g., which clusters contain homeowners). ■ Cluster Profiles: Unlike node shading in the Cluster Diagram Viewer, where one variable value can be examined at a time, the Cluster Profiles Viewer shows all variables and clusters in a single matrix. Each cell of the matrix is a graphical representation of that variable’s dis- tribution in the given cluster (see Figure 76-5). Discrete variables are shown as stacked bars describing how many cases contain each of the possible variable values. Continuous variables are shown as diamond charts, with each diamond centered on the mean (average) value for cases in that cluster, while the top and bottom of the diamond are the mean +/− the standard deviation, respectively. Thus, the taller the diamond, the less uniform the variable values in that cluster. Click on a cell (chart) to see the full distribution for a cluster/variable combi- nation in the Mining Legend, or hover over a cell for the same information in a tooltip. In Figure 76-5, the tooltip displayed shows the full population’s occupation distribution, while the Mining Legend shows Cluster 3’s total children distribution. ■ Cluster Characteristics: This view displays the list of characteristics that make up a cluster and the probability that each characteristic will appear. ■ Cluster Discrimination: Similar to the Characteristics Viewer, this shows which character- istics favor one cluster versus another. It also enables the comparison of a cluster to its own complement, clearly showing what is and is not in a given cluster. Once you gain a better understanding of the clusters for a given model, it is often useful to rename each cluster to something more descriptive than the default ‘‘Cluster n.’’ From within either the Diagram or Profiles Viewer, right-click on a cluster and choose Rename Cluster to give it a new name. 1606 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1607 Data Mining with Analysis Services 76 FIGURE 76-5 Cluster Profiles Viewer Sequence clustering As the name implies, this algorithm still gathers cases together into clusters, but based on a sequence of events or items, rather than on case attributes. For example, the sequence of web pages visited during user sessions can be used to define the most common paths through that website. The nature of this algorithm requires input data with a nested table, whereby the parent row is the session or order (e.g., shopping cart ID) and the nested table contains the sequence of events during that session (e.g., order line items). In addition, the nested table’s key column must be marked as a Key Sequence content type in the mining structure. Once the model is trained, the same four cluster viewers described above are available to describe the characteristics of each. In addition, the State Transition Viewer displays transitions between two items (e.g., a pair of web pages), with its associated probability of that transition happening. Move the slider to the bottom to see only the most likely transitions. Select a node to highlight the possible transitions from that item to its possible successors. The short arrows that don’t connect to a second node denote a state that can be its own successor. 1607 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1608 Part X Business Intelligence Neural Network This famous algorithm is generally slower than other alternatives, but often handles more complex sit- uations. The network is built using input, hidden (middle), and output layers of neurons whereby the output of each layer becomes the input of the next layer. Each neuron accepts inputs that are combined using weighted functions that determine the output. Training the network consists of determining the weights for each neuron. The Neural Network Viewer presents a list of characteristics (variable/value combinations) and how those characteristics favor given outputs (outcomes). Choose the two outcomes being compared in the Output area at the upper right (see Figure 76-6). Leaving the Input area in the upper left blank compares characteristics for the entire population, whereas specifying a combination of input values allows a portion of the population to be explored. For example, Figure 76-6 displays the characteristics that affect the buying decisions of adults less than 36 years of age with no children. FIGURE 76-6 Neural Network Viewer 1608 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1609 Data Mining with Analysis Services 76 Logistic regression Logistic regression is a special case of the neural network algorithm whereby no hidden layer of neurons is built. While logistic regression can be used for many tasks, it is specially suited for estimation prob- lems for which linear regression would be a good fit. However, because the predicted value is discrete, the linear approach tends to predict values outside the allowed range — for example, predicting proba- bilities over 100 percent for a certain combination of inputs. Because it is derived from the neural network algorithm, logistic regression shares the same viewer. Naive Bayes Naive Bayes is a very fast algorithm with accuracy that is adequate for many applications. It does not, however, operate on continuous variables. The Naive portion of its name derives from this algorithm’s assumption that every input is independent. For example, the probability of a married person purchas- ing a bike is computed from how often married and bike buyer appear together in the training data without considering any other columns. The probability of a new case is just the normalized product of the individual probabilities. Several viewers display data from the finished model: ■ Dependency Network: Displays both input and predictable columns as nodes with arrows indicating what predicts what; a simple example is shown in Figure 76-7. Move the slider to the bottom to see only the most significant predictions. Click on a node to highlight its relationships. ■ Attribute Profiles: Similar in function to the Cluster Profiles Viewer, this shows all variables and predictable outcomes in a single matrix. Each cell of the matrix is a graphical representa- tion of that variable’s distribution for a given outcome. Click on a cell (chart) to see the full distribution for that outcome/variable combination in the Mining Legend, or hover over a cell for the same information in a tooltip. ■ Attribute Characteristics: This viewer displays the list of characteristics associated with the selected outcome. ■ Attribute Discrimination: This viewer is similar to the Characteristics Viewer, but it shows which characteristics favor one outcome versus another. Association rules This algorithm operates by finding attributes that appear together in cases with sufficient frequency to be significant. These attribute groupings are called itemsets, which are in turn used to build the rules used to generate predictions. While Association Rules can be used for many tasks, it is specially suited to market basket analysis. Generally, data will be prepared for market basket analysis using a nested table, whereby the parent row is a transaction (e.g., Order) and the nested table contains the individual items. Three viewers provide insight into a trained model: ■ Rules: Similar in layout and controls to itemsets, but lists rules instead of itemsets. Each rule has the form A, B ≥ C, meaning that cases that contain A and B are likely to contain C (e.g., people who bought pasta and sauce also bought cheese). Each rule is listed with its probability (likelihood of occurrence) and importance (usefulness in performing predictions). 1609 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1610 Part X Business Intelligence ■ Itemsets: Displays the list of itemsets discovered in the training data, each with its associated size (number of items in the set) and support (number of training cases in which this set appears). Several controls for filtering the list are provided, including the Filter Itemset text box, which searches for any string entered (e.g., ‘‘Region = Europe’’ will display only itemsets that include that string). ■ Dependency Network: Similar to the Dependency Network used for other algorithms, with nodes representing items in the market basket analysis. Note that nodes have a tendency to predict each other (dual-headed arrows). The slider will hide the less probable (not the less important) associations. Select a node to highlight its related nodes. FIGURE 76-7 Naive Bayes Dependency Network Viewer Time series The time series algorithm predicts the future values for a series of continuous data points (e.g., web traffic for the next six months given traffic history). Unlike the algorithms already presented, prediction does not require new cases on which to base the prediction, just the number of steps to extend the 1610 www.getcoolebook.com Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1611 Data Mining with Analysis Services 76 series into the future. Input data must contain a time key to provide the algorithm’s time attribute. Time keys can be defined using date, double, or long columns. Once the algorithm has run, it generates a decision tree for each series being forecast. The decision tree defines one or more regions in the forecast and an equation for each region, which can be reviewed using the Decision Tree Viewer. For example, a node may be labeled Widget.Sales-4 < 10,000, which is interpreted as ‘‘use the equation in this node when widget sales from four time-steps back is less than 10,000.’’ Selecting a node will display two associated equations in the Mining Legend, and hovering over the node will display the equation as a tooltip — SQL Server 2008 added the second equation, providing better long-term forecasts by blending these different estimation techniques. Note the Tree pull-down at the top of the viewer that enables the models for different series to be examined. Each node also displays a diamond chart whose width denotes the variance of the pre- dicted attribute at that node. In other words, the narrower the diamond chart, the more accurate the prediction. The second Time Series Viewer, labeled simply Charts, plots the actual and predicted values of the selected series over time. Choose the series to be plotted from the drop-down list in the upper-right corner of the chart. Use the Abs button to toggle between absolute (series) units and relative (percent change) values. The Show Deviations check box will add error bars to display expected variations on the predicted values, and the Prediction Steps control enables the number of predictions displayed. Drag the mouse to highlight the horizontal portion of interest and then click within the highlighted area to zoom into that region. Undo a zoom with the zoom controls on the toolbar. Because prediction is not case based, the Mining Accuracy Chart does not function for this algorithm. Instead, keep later periods out of the training data and compare predicted values against the test data’s actuals. Cube Integration Data mining can use Analysis Services cube data as input instead of using a relational table (see the first page of the Data Mining Wizard section earlier in this chapter); cube data behaves much the same as relational tables, with some important differences: ■ Whereas a relational table can be included from most any data source, the cube and the mining structure that references it must be defined within the same project. ■ The case ‘‘table’’ is defined by a single dimension and its related measure groups. When additional data mining attributes are needed, add them via a nested table. ■ When selecting mining structure keys for a relational table, the usual choice is the primary key of the table. Choose mining structure keys from dimension data at the highest (least granular) level possible. For example, generating a quarterly forecast requires that quarter to be chosen as the key time attribute, not the time dimension’s key (which is likely day or hour). ■ Data and content type defaults tend to be less reliable for cube data, so review and adjust type properties as needed. ■ Some dimension attributes based on numeric or date data may appear to the data mining interface with a text data type. A little background is required to understand why this hap- pens: When a dimension is built, it is required to have the Key column property specified. 1611 www.getcoolebook.com . in the Mining Legend, and hovering over the node will display the equation as a tooltip — SQL Server 2008 added the second equation, providing better long-term forecasts by blending these different. to specify values. Specify SELECT columns and prediction functions in the grid at the bottom. SQL Server Management Studio also offers a DMX query type with meta-data panes for drag-and-drop access. Nielsen c76.tex V4 - 07/21/2009 4:28pm Page 1602 Part X Business Intelligence ■ Input columns are too case specific (e.g., IDs, names, etc.). Adjust