ptg 2084 CHAPTER 51 SQL Server 2008 Analysis Services FIGURE 51.57 Creating KPIs in the cube designer. previously known. As you create dimensions, you can even choose a data mining model as the basis for a dimension. Basically, a data mining model is a reference structure that represents the grouping and predictive analysis of relational or multidimensional data. It is composed of rules, patterns, and other statistical information of the data that it was analyzing. These are called cases. A case set is simply a means for viewing the physical data. Different case sets can be constructed from the same physical data. Basically, a case is defined from a particu- lar point of view. If the algorithm you are using supports the view, you can use mining models to make predictions based on these findings. Another aspect of a data mining model is using training data. This process determines the relative importance of each attribute in a data mining model. It does this by recursively partitioning data into smaller groups until no more splitting can occur. During this parti- tioning process, information is gathered from the attributes used to determine the split. Probability can be established for each categorization of data in these splits. This type of data can be used to help determine factors about other data utilizing these probabilities. This training data, in the form of dimensions, levels, member properties and measures, is used to process the OLAP data mining model and further define the data mining column structure for the case set. In SSAS, Microsoft provides several data mining algorithms (or techniques): . Association Rules—This algorithm builds rules that describe which items are most likely to appear together in a transaction. The rules help predict when the presence of one item is likely with another item (which has appeared in the same type of transaction before). . Clustering—This algorithm uses iterative techniques to group records from a dataset into clusters that contain similar characteristics. This is one of the best algo- rithms, and it can be used to find general groupings in data. ptg 2085 An OLAP Requirements Example: CompSales International 51 . Sequence Clustering—This algorithm is a combination of sequence analysis and clustering, and it identifies clusters of similarly ordered events in a sequence. The clusters can be used to predict the likely ordering of events in a sequence, based on known characteristics. . Decision Trees—This classification algorithm works well for predictive modeling. It supports the prediction of both discrete and continuous attributes. . Linear Regression—This regression algorithm works well for regression modeling. It is a configuration variation of the Decision Trees algorithm, obtained by disabling splits. (The whole regression formula is built in a single root node.) The algorithm supports the prediction of continuous attributes. . Logistic Regression—This regression algorithm works well for regression modeling. It is a configuration variation of the Neural Network algorithm, obtained by elimi- nating the hidden layer. This algorithm supports the prediction of both discrete and continuous attributes. . Naïve Bayes—This classification algorithm is quick to build, and it works well for predictive modeling. It supports only discrete attributes, and it considers all the input attributes to be independent, given the predictable attribute. . Neural Network—This algorithm uses a gradient method to optimize parameters of multilayer networks to predict multiple attributes. It can be used for classification of discrete attributes as well as regression of continuous attributes. . Time Series—This algorithm uses a linear regression decision tree approach to ana- lyze time-related data, such as monthly sales data or yearly profits. The patterns it discovers can be used to predict values for future time steps across a time horizon. To create an OLAP data mining model, SSAS uses either an existing source OLAP cube or an existing relational database/data warehouse, a particular data mining technique/algo- rithm, case dimension and level, predicted entity, or, optionally, training data. The source OLAP cube provides the information needed to create a case set for the data mining model. You then select the data mining technique (decision tree, clustering, or one of the others). It uses the dimension and level that you choose to establish key columns for the case sets. The case dimension and level provide a certain orientation for the data mining model into the cube for creating a case set. The predicted entity can be either a measure from the source OLAP cube, a member property of the case dimension and level, or any member of another dimension in the source OLAP cube. NOTE The Data Mining Wizard can also create a new dimension for a source cube and enables users to query the data mining data model data just as they would query OLAP data (using the SQL DMX extension or the mining structures browser). In Visual Studio, you simply initiate the Data Mining Wizard by right-clicking the Mining Structures entry in the Solution Explorer. You cannot create new mining structures from ptg 2086 CHAPTER 51 SQL Server 2008 Analysis Services SSMS. When you are past the wizard’s splash screen, you have the option of creating your mining model from either an existing relational database (or data warehouse) or an exist- ing OLAP cube (as shown in Figure 51.58). You want to define a data mining model that can shed light on product (SKU) sales char- acteristics and that will be based on the data and structure you have created so far in your Comp Sales Unleashed cube. For this example, you choose to use the existing OLAP cube you already have (from the existing cube method). You must now select the data mining technique you think will help you find value in your cube’s data. Clustering is probably the best one to start from because it finds natural groupings of data in a multidimensional space. It is useful when you want to see general groupings in your data, such as hot spots. You are trying to find just such things with sales of products (for example, things that sell together or belong together). Figure 51.59 shows the data mining technique Microsoft Clustering being selected. Now you have to identify the source cube dimension to use to build the mining structure. As you can see in Figure 51.60, you choose Product Dimension to fit the mining inten- tions stated earlier. You then select the case key or point of view for the mining analysis. Figure 51.61 illus- trates the case to be based on the product dimension and at the SKU level (that is, the individual product level). FIGURE 51.58 Selecting the definition method to used for the mining structure in the Data Mining Wizard. ptg 2087 An OLAP Requirements Example: CompSales International 51 FIGURE 51.59 Using clustering to identify natural groups in the Data Mining Wizard. FIGURE 51.60 Identifying the product dimension as the basis for the mining structure in the Data Mining Wizard. ptg 2088 CHAPTER 51 SQL Server 2008 Analysis Services You now specify the attributes and measures as case-level columns of the new mining structure. Figure 51.62 shows the possible selections. You can simply choose all the data measures for this mining structure. Then you click the Next button. As you can see in Figure 51.63, the next few wizard dialogs allow you to specify the mining structure column’s content and data types (use the defaults that were detected for most items unless we specifically describe something different), identify a filtered slice to use for the model training (you don’t need to use this now because you want the whole cube), and finally identify the number of cases to be reserved for model testing (use a percentage of data for testing to be about 33%). The mining model is now specified and must be named and processed. Figure 51.64 shows what you have named the mining structure ( Product Dimension MS) and the mining model name itself ( Product Dimension MM). Also, you select the Allow Drill Through option so you can look further into the data in the mining model after it is processed. Then you click the Finish button. When the Data Mining Wizard is complete, the mining structure viewer pops up, with your mining structure case-level column’s specification (on the center left) and its correla- tion to your cube (see Figure 51.65). You must now process the mining structure to see what you come up with. You do this by selecting the Mining Model toolbar option and selecting the Process option. You then see the usual Process dialog, and you have to choose to run this (process the mining struc- ture). After the mining structure processing completes, a quick click on the Cluster FIGURE 51.61 Identifying the basic unit of analysis for the mining model in the Data Mining Wizard. ptg 2089 An OLAP Requirements Example: CompSales International 51 FIGURE 51.62 Specifying the measure for the mining model in the Data Mining Wizard. FIGURE 51.63 Specifying a column’s content, slice filters, and model data training percent- ages. Diagram tab shows the results of the clustering analysis (see Figure 51.66). Notice that because you selected to allow drill through, you can simply right-click any of the clusters identified and see the data that is part of the cluster (and choose Drill Through). This viewer clearly shows that there is some clustering of SKU values that might indicate prod- ucts that sell together or belong together. ptg 2090 CHAPTER 51 SQL Server 2008 Analysis Services FIGURE 51.64 Naming the mining model and completing the Data Mining Wizard. FIGURE 51.65 Your new mining structure in the mining structure viewer. If you click the Cluster Profiles tab of this viewer, you see the data value profile character- istics that were processed (see Figure 51.67). Figure 51.68 shows the clusters of data values of each data measure in the data mining model. This characteristic information gives you a good idea of what the actual data values are and how they cluster together. ptg 2091 An OLAP Requirements Example: CompSales International 51 FIGURE 51.66 Clustering results and drilling through to the data in the mining model viewer. FIGURE 51.67 Cluster data profiles in the mining model viewer. Finally, you can see the cluster node contents at the detail level by changing the mining model viewer type to Microsoft Generic Content Tree Viewer, which is just below the Mining Model Viewer tab on top. Figure 51.69 shows the detail contents of each model node and its technical specification of a report format. If you want, you can now build new cube dimensions that can help you do predictive modeling based on the findings of the data mining structures you just processed. In this way, you could predict sales units of one SKU and the number of naturally clustered SKUs quite easily (based on the past data mining analysis). This type of predictive modeling is very powerful. ptg 2092 CHAPTER 51 SQL Server 2008 Analysis Services FIGURE 51.68 Cluster characteristics of the data values for each measure in the mining model viewer. FIGURE 51.69 The Microsoft Generic Content Tree Viewer of the cluster nodes in the mining model viewer. ptg 2093 An OLAP Requirements Example: CompSales International 51 SSIS SSIS provides a robust means to move data between sources and targets. Data can be exported, validated, cleaned up, consolidated, transformed, and then imported into a destination of any kind. With any OLAP/SSAS implementation, you will undoubtedly have to transform, clean, or preprocess data in some way. You can now tap into SSIS capa- bilities from within the SSAS platform. You can combine multiple column values into a single calculated destination column or divide column values from a single source column into multiple destination columns. You might need to translate values in operational systems. For example, many OLTP systems use product codes stored as numeric data. Few people are willing to memorize an entire collection of product codes. An entry of 100235 for a type of shampoo in a product dimension table is useless to a vice president of marketing who is interested in how much of that shampoo was sold in California in the past quarter. Cleanup and validation of data are critical to the data’s value in the data warehouse. The old saying “garbage in, garbage out” applies. If data is missing, redundant, or inconsistent, high-level aggregations can be inaccurate, so you should at least know that these condi- tions exist. Perhaps data should be rejected for use in the warehouse until the source data can be reconciled. If the shampoo of interest to the vice president is called Shamp in one database and Shampoo in another, aggregations on either value would not produce complete information about the product. The SSIS packages define the steps in a transformation workflow. You can execute the steps serially and in combinations of serially, in parallel, or conditionally. For more infor- mation on SSIS, refer to Chapter 46, “SQLCR: Developing SQL Server Objects in .NET.” OLAP Performance Performance is a big emphasis of SSAS. Usage-based aggregation is at the heart of much of what you can do to help in this area. In addition, the proactive caching mechanism in SSAS has allowed much of what was previously a bottleneck (and a slowdown) to be circumvented. When designing cubes for deployment, you should consider the data scope of all the data accesses (that is, all the OLAP queries that will ever touch the cube). You should only build a cube that is big enough to handle these known data scopes. If you don’t have requirements for something, you shouldn’t build it. This helps keep things a smaller, more manageable size (that is, smaller cubes), which translates into faster overall performance for those who use the cube. You can also take caching to the extreme by relocating the OLAP physical storage compo- nents on a solid-state disk device (that is, a persistent memory device). This can give you tenfold performance gains. The price of this type of technology has been dramatically . powerful. ptg 2092 CHAPTER 51 SQL Server 2008 Analysis Services FIGURE 51.68 Cluster characteristics of the data values for each measure in the mining model viewer. FIGURE 51.69 The Microsoft Generic Content. in the Solution Explorer. You cannot create new mining structures from ptg 2086 CHAPTER 51 SQL Server 2008 Analysis Services SSMS. When you are past the wizard’s splash screen, you have the option. dimension as the basis for the mining structure in the Data Mining Wizard. ptg 2088 CHAPTER 51 SQL Server 2008 Analysis Services You now specify the attributes and measures as case-level columns of