Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1582 Part X Business Intelligence Data tables Data retrieved into a table takes on all the capabilities of an Excel table: ■ Table formatting and totals: Click inside the table, then choose a style from the Design tab, and the entire table’s formatting will change to that style. The Total Row check box here enables a row at the bottom of the table for summary functions (e.g., SUM, AVERAGE, COUNT) that will apply to all extracted data, r egardless of how many rows are returned at the next refresh. ■ Conditional formatting: Select a column in the table, choose a format from the Conditional Formatting menu on the Home tab, and the color, data bars, or icons will overlay the table data to highlight variations in values. ■ Filter and sort: Clicking on the column header menu enables visible rows to be filtered by picking individual values in that column, by defining conditions (e.g., greater or less than value or average, top 10, etc.), or based on conditional formatting applied to that column. Similarly the column can be sorted by either value or conditional formatting. ■ Add/Remove columns: Insert a new column into the table and enter an Excel formula into any cell within that column to create a calculated column. Additionally, entire columns can be eliminated from the table by deleting that column without the need to change the connection definition. Similarly, you can also remove rows from the table. However, these rows will reappear the next time the table is refreshed. The latest data from the database can be retrieved at any time by right-clicking on the table and choos- ing the Refresh item, or by choosing one of the Refresh options from the Data tab. None of the changes made to the Excel table will change data in the source database. PivotTables PivotTables and PivotCharts are powerful analysis tools that work for both relational and Analysis Services data. The way Excel interacts with the source data is fundamentally different between these two types of data, however. For relational data sources, Excel reads the entire data set from the database as soon as the P ivotTable is created, storing it invisibly within the workbook in a PivotCache object. This enables the PivotTable to respond to changes without querying the underlying data each time, but it can make for a very large workbook when the data set is large. By contrast, Analysis Services data sources are queried for each update to the PivotTable, keeping the workbook size down and relying on the responsiveness of Analysis Services. PivotTables created on Analysis Services data sources reflect the latest data with every change to the PivotTable, whereas relationally based PivotTables only reflect new data when explicitly refreshed (or refreshed by the connection definition). Start a PivotTable by either choosing a connection from the Data tab or choosing PivotTable from the Insert tab. The idea of pivoting data is to display summaries based on categories that are placed as row and column headers. As categories are dropped onto the header areas, the table quickly reformats itself to display values grouped by all the currently selected category values, as shown in Figure 75-1. 1582 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1583 Analyzing Data with Excel 75 FIGURE 75-1 Excel PivotTable based on Analysis Services cube Once the PivotTable is added to a worksheet, available data fields are displayed in the PivotTable Field List, ready for dragging onto one of the four table areas: ■ Values: The center of the table that displays data aggregates, such as the Internet Order Count shown in Figure 75-1 ■ Row Labels: Category data that provides row headers on the left side of the table (e.g., Calendar Year in Figure 75-1) ■ Column Labels: Category data that provides column headers along the top of the table (e.g., Stage-Province in Figure 75-1) ■ Report Filter: Provides an overall filter for the PivotTable that does not change the layout of the table (e.g., Country in Figure 75-1) While the Field List panel is basically the same for both relational and Analysis Services data sources, the Analysis Services version includes additional information. Values (called measures in Analysis Ser- 1583 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1584 Part X Business Intelligence vices) are differentiated from category items by the symbol. In addition, the ‘‘Show fields related to’’ filter at the top of the panel restricts the Field List to only those items in the selected group of values (called measure groups in Analysis Services). Catego ry items within Field List can be organized into fold- ers by setting an item’s AttributeHierarchyDisplayFolder property in Analysis Services, which cause the folders to appear next to Contacts and other category groups in Figure 75-1. Finally, Analysis Services defines hierarchies that allow drill-down paths, such as Calendar and Customer Geography in Figure 75- 1, which enable details to be toggled in outline form. Once fields have been placed in the PivotTable, field-specific settings are available. Right-click on a field in the PivotTable to access the following: ■ Field settings: These provide control over subtotals, layout, number format, and how val- ues are calculated. Calculation options include basic aggregation functions ( SUM, COUNT, AVERAGE, etc.), as well as ‘‘% of,’’ ‘‘Running total,’’ and several other options. ■ Sort settings: Choose to sort rows or c olumns based on either headers or values. ■ Filter: Individual header values can be selected, Label filters can be defined (e.g., State- Province does not contain ‘‘Wales’’), or Value filters can be defined (e.g., show only periods with more than 100 orders). ■ Properties: Analysis Services data sources associate properties with many of the values listed in the header. Some of these values may not be available directly in the Field List. Properties may be exposed either directly as columns in the spreadsheet or as a tooltip when the cursor hovers over a header. ■ Additional Actions: Analysis Services can associate actions, such as running reports, with header values. Because PivotTables display summary data, it is often useful to drill into the details behind a sum or count. Double-clicking on any value will create a new worksheet with the associated detail rows. By default, Analysis Services data sources limit the rows returned by a drill-through to 1,000, but this maximum is configurable via the Connection Properties dialog. After a bit of practice, generating a desired view in this environment is extremely time efficient, limited mostly by the speed of the underlying data source. Insights into data can be gained at a surprising rate. PivotCharts PivotCharts (see Figure 75-2 for an example) are bound to a PivotTable, displaying the contents of the table as it changes. The PivotTable’s row headers appear as axis labels in the chart, and its column headers appear as entries in the legend. You can create a PivotChart either by choosing the PivotChart option when the P ivotTable is created or by clicking inside of an existing PivotTable and inserting an Excel chart. You can control the content of the PivotChart with either the full-featured PivotTable Field List or the simplified PivotChart Filter pane. The majority of Excel chart functions are available for a PivotChart, including creating a full-page chart by right-clicking and choosing the Move option. 1584 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1585 Analyzing Data with Excel 75 FIGURE 75-2 Excel PivotChart based on Analysis Services cube Advanced Data Analysis The SQL Server Data Mining Add-ins for Office 2007 make a number of additional features available in Excel for analyzing data. This free download enhances Excel with features that make it easier to explore and prepare data sets, perform common analyses using data mining, and allow Excel to act as a full data mining client. This approach of encapsulating common data mining analyses in Excel is extremely powerful, allowing a much wider audience to use data mining than would otherwise access them. Note that most of these features require an Analysis Services server to execute the associated data mining processing. See Chapter 76, ‘‘Data Mining with Analysis Services’’ for more detail on how to approach data mining projects and available algorithms. 1585 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1586 Part X Business Intelligence Installing the data mining add-ins Start by downloading and installing the add-ins. The product page for the add-ins, www.microsoft .com/sqlserver/2008/en/us/data-mining-addins.aspx , includes pointers to the download, tutorials, webcasts, and labs. Because executing the data mining algorithms requires access to an Analysis Services server, setup installs and provides a link to run the Server Configuration Utility. The configuration wizard will set up a new Analysis Services database in which Excel mining models can be created, or it enables you to identify an existing database if one has already been created for that purpose. This process assumes that an Analysis Services server is available and the account used for installation has adequate permissions. The configuration utility will also suggest enabling the creation of temporary mining models, which is important to prevent the database becoming filled with junk objects as a result of the models Excel will create. Once the install and configuration steps are complete, Excel’s Ribbon will have two new tabs: Data Mining and Analyze (select some portion of a table to see the Analyze tab, described in ‘‘Table Analysis Tools’’ later in this chapter). Exploring and preparing data Using these advanced functions is easiest when the data set being analyzed is defined as an Excel table. Data imported from external sources is automatically defined as a table, but other data, such as that entered into Excel via a copy/paste operation, will not automatically be defined as a table. A simple way to check whether a data set has been defined as a table is to select a cell in the table, and if it has been defined as a table, the Table Tools group of tabs will appear in Excel’s Ribbon. Convert a range of cells into a table by first ensuring that the top row of cells contains column headers for the table, selecting a cell in the range to be converted, and then choosing Table from the Insert tab. Excel assigns table names that may be less than intuitive. Table names can be adjusted by selecting a cell in a table, choosing the ‘‘Table Tools’’ Design tab, and typing over the name that appears on the left-hand side of the Ribbon. Once the data has been organized as desired, there are three actions in the Data Preparation group of the Data Mining tab described in this section. While these functions are intended to prepare data for use by the data mining client, they can be useful for a wide variety of situations. None of these explore and prepare data functions rely on data mining algorithms, nor do they communicate with the Analysis Ser- vices server. Explore Data Choose Explore Data and the wizard will prompt for a table and column name, and then display a histogram of rows for each value in that column. For example, Figure 75-3(a) shows the count of rows for each value in the NumberChildrenAtHome column. For numeric data, an alternate display can be toggled via the icons at the lower left, allowing the data to be grouped into equally sized buckets of values, as shown in Figure 75-3(b). This is very useful for columns that contain a large number of values, such as dates, salaries, and so on. Displays in numeric mode can also add a new column to the source table to denote into which bucket each row falls. The copy button will snapshot the histogram chart for pasting in any application that accepts bitmap graphics. 1586 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1587 Analyzing Data with Excel 75 FIGURE 75-3 Explore Data histograms in (a) Discrete and (b) Numeric displays (a) (b) Clean Data Choose Clean Data and two options will appear: Outliers and Re-label. Outliers is very similar to Explore Data as described above, except that when the histogram displays, sliders appear that allow the elimination of extreme data values in the table. For numeric values, this includes identifying minimum and maximum allowable values, with several handling options: replacing an outlier with limit values, replacing an outlier with a mean value, simply clearing the outlier, or totally removing the offending row. For text values, infrequently used values can be defined as outliers. This enables, for example, the top 10 occurring cities to be surfaced in an analysis, with less f requently occurring cities to be grouped under an Other category. In addition to replacing values, text values can be cleared or the associated rows removed f rom the table. The Re-label variant can be thought o f as a structured search and replace. After identifying the table and column of interest, the wizard presents a list of current values in that column, prompting for the new values with which they should be replaced. This function is useful for fixing data entry problems, map- ping abbreviations to reporting descriptions, or even grouping data into categories. Partition Data Choose this function to copy rows from a source table to new tables in useful ways: ■ Split data into training and testing sets: When building data mining models, it is necessary not only to train a model using part of the available data, but also to reserve a part of that data for testing the trained model to assess how well it will perform on data it has not yet seen. This option will split the source table into two separate tables for this purpose based on a chosen ratio, randomly selecting which rows fall into each set. ■ Random sampling: This option extracts a random sample of the rows based on a sup- plied ratio or row count. While very similar in function to the Split option, it more directly 1587 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1588 Part X Business Intelligence addresses the need to select a small sample from a large one, or generate multiple training sets to assess differences that training a model on different data slices present. ■ Oversampling to balance data distribution: Data sets sometimes do not accurately repre- sent the populations they are meant to model. Oversampling is a method to compensate for sampling bias in a data set. Indicate to the wizard the column and associated value to sample, and the resulting new data set will guarantee a representation of rows with a specified ratio. Table analysis tools Select a cell inside of a table and the Table Tools tabs will become available, including the Analyze tab. The functions on this tab are common data mining operations that have been made nearly single-click operations. All of these operations use Analysis Services to run the associated data mining algorithms. The server and database used can be changed by choosing Connection from the Ribbon. Analyze Key Influencers Data sets that include predictable outcome(s) often have many attributes, not all of which are important in determining the outcome. Select a cell in the table, choose the Analyze Key Influencers option, tell the wizard which column contains the outcome, and Excel will build a Naïve Bayes model to determine which attributes (columns) are most influential in determining the outcome. Excel will automatically add a worksheet and report on key influencers. Additional report sections can be generated to contrast influ- encers for selected outcomes. The resulting report provides some initial insight into the data set being analyzed, and suggests attributes that should definitely be included when developing a predictive analysis. However, it is important to understand that these are often not the only attributes that influence the outcome. Naïve Bayes is the simplest of algorithms and will only detect very direct relationships. Detect Categories It is often useful to group cases (rows) in a data set into groups to better understand the population. For example, grouping customers by common traits c ould yield insights that lead to more targeted marketing campaigns. Tell the wizard which columns to consider in determining the categories, limit the number of categories that will be created if desired, and click Run. Excel will build a clustering model to put similar cases into distinct buckets, add the category names as a new column to the source table, and then add a worksheet that enables the exploration and naming of the associated categories. The Categories Report page contains three sections, including notes about how to use each. The top- level summary shows how many cases fall into each category and allows the categories to be renamed. The second section shows the characteristics of the selected category — change the filter on the cate- gory column to display other categories. The third section shows how a selected column varies across all categories — change the column displayed by right-clicking on the x-axis and choosing the ‘‘Sort and Filter’’ menu item. Highlight Exceptions The wizard and algorithm for this analysis is identical to Detect Categories described above, but instead of presenting a r eport that enables exploration of the categories, cases that don’t fall inside the categories are identified. This extends the idea of outlier detection to the next level, looking not just at the range of 1588 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1589 Analyzing Data with Excel 75 a single column, but at how that column’s value fits with other attributes in that row. The result finds combinations that while not impossible are unlikely, such as managers with entry-level salaries. Basic outlier detection would not recognize this problem because the salary is in a valid range for the data set as a whole. Excel builds the categories a nd then looks at every table row in turn, using the model to predict the likelihood of that row given the category definition. When the likelihood falls below the user-defined threshold, that row in the table is highlighted. In addition, the value in each column is evaluated for its likelihood as well, and the least likely value is highlighted. Excel automatically adds a report work- sheet that summarizes the exceptions found by the least-likely column. The report also contains the threshold that determines which likelihoods are considered exceptions — adjust this value to see fewer or more exceptions. When reviewing exceptions in a large table, it is helpful to sort by color to put all the exceptions in one place: Right-click on an exception, and select Sort ➪ Put selected cell color on top. Fill from Example Excel will build a logistic regression model to detect patterns and estimate data for a column with miss- ing data. Tell the wizard the column to be filled in and the columns to be used to detect patterns, and Excel will add a new column to the table with all the values filled. In addition, a report is added sum- marizing the patterns used to determine the missing values. This feature is meant to handle a variety of missing data cases, such as surveys that are missing some responses, assuming that patterns in the known attributes will be a good predictor of the missing attribute. The model that Excel builds will always supply a value for the missing attribute even when it is not correct, so before accepting the values provided, find some way to validate the model before blindly accepting its results. For example, add some test rows (cases with known values for the attribute in question) to the table without values for the attribute in question and compare the value generated by the model to their actual values. Forecasting Forecasting estimates the next steps of a series given its history. For example, what will next quarter’s sales be? Set up a table with all the related series in columns, with one time column. It is best to include related series, as the Time Series algorithm finds relationships between series that can help build better forecasts. For example, last year’s software sales numbers may help predict this year’s maintenance sales. Indicate in the wizard the series to be predicted, which column contains time, and the number of peri- ods to be predicted. If the data has an inherent periodicity, such as a quarterly sales cycle, supplying that information i n the wizard as a hint to the algorithm may improve the forecast. Excel will extend the source table with new rows at the bottom containing predicted values. In addition, a forecasting report worksheet is added showing a graph with the existing and predicted values. A good test for the reliability of the forecast is to copy the source worksheet, remove the last few peri- ods, and run the forecast to predict known values. Comparing the actual and predicted values will give you an indication of the reliability of the forecast going forward. 1589 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1590 Part X Business Intelligence Scenario Analysis Scenario Analysis investigates how changes to the source data set affect the outcome. The ‘‘What-if’’ option enables the user to form an exact question — for example, ‘‘How many more customers would purchase a bike if their income went up 20%?’’ The ‘‘Goal Seek’’ option asks Excel to find the value at which the desired outcome occurs — for example, ‘‘How much more income would our customers need before purchasing a bike?’’ Excel builds a logistic regression model to estimate the impact of changes. Upon completion of a ‘‘What-if’’ scenario for the entire table, new columns are added to the source table showing the new value of the outcome column and the confidence in the result. ‘‘Goal Seek’’ for the entire table adds columns for the new value of the outcome column and the new value for the column being adjusted. Prediction Calculator The Prediction Calculator functions by first using the data in an Excel table to train a logistic regression model to predict an outcome, and then makes the resulting model available as a calculator in Excel to evaluate individual cases without being connected to the Analysis Services server. For example, a model could be trained to predict component failure based on measurable attributes, and then made available to technicians performing preventive maintenance. Inform the wizard of what attribute and attribute value is to be predicted, and it will create up to three new sheets in the workbook: ■ Prediction Report: Lists all the significant attribute/value combinations found in building the model and their impact on the result. In addition, if the user enters costs associated with correct and incorrect guesses into the interactive profit calculator (e.g., cost of a component failure vs. cost of replacing a component that would not have failed), a threshold will be calculated for how likely an outcome must be before it is predicted. This threshold is then used in the calculator pages. ■ Prediction Calculator (optional): Enter the values for a case and see the predicted outcome. ■ Printable Calculator (optional): This contains a printable form that can be used for data collection and later entry into Excel, or even manual calculation without entry into Excel. Shopping Basket Analysis The Shopping Basket Analysis is a quick way to build an association rules model based on the data in an Excel table. This model will identify groups of items that normally appear together in a transaction, allowing better product organization and/or suggestions to customers — for example, the famous ‘‘Customers who bought this book also bought ’’ The Excel table must contain certain columns that are indicated in the wizard: ■ Transaction ID: The Order Number, Session ID, o r some other identifier that ties multiple rows together into a single transaction ■ Item: The name or other identifier of the item purchased ■ Item Value (optional): The price or value of the item included in that transaction. This enables the results to be sorted on the total value that a ‘‘basket’’ represents (average price of the basket * number of sales). As a result, a priority can be placed on suggestions that will likely yield greater revenue. 1590 www.getcoolebook.com Nielsen c75.tex V4 - 07/21/2009 4:22pm Page 1591 Analyzing Data with Excel 75 After the model has been built, two sheets are added to the workbook. The Bundled Items report details all the bundles (item combinations) found and their associated sales and price information. The Recom- mendations report lists recommendation rules by item, the proposed recommendation, and supporting statistics. Data mining client The Data Mining tab added by installing the SQL Server Data Mining Add-ins for Office 2007 provides a full data mining environment, equivalent to the data mining environment provided by Visual Studio (also known as Business Intelligence Development Studio). Unlike the table analysis tools described earlier, whereby tables and reports are created directly in Excel, the primary focus here is on creating, training, browsing, and querying data mining models i n an Analysis Services database. Working from within Excel to develop models can have advantages over the Visual Studio environment, especially when working with small amounts of data early in the process, when cleaning and exploring the data set, as the data set can be quickly changed in Excel and used to train and test models in Anal- ysis Services. However, there are limitations to the Excel environment, such as the inability to show the accuracy of competing models in the same accuracy chart. See Chapter 76, ‘‘Data Mining with Analysis Services,’’ to learn more about the data min- ing features of Analysis Services and the functions detailed here. Functions exposed on the Data Mining tab include the following: ■ Data Preparation: This is described in the section ‘‘Exploring and Preparing Data’’ earlier in the chapter. ■ Data Modeling: Allows the creation of mining structures and models. Several of the most popular models are listed as separate functions, while the Advanced option provides access to all available algorithms. ■ Accuracy and Validation: Provides different views of model performance on test data ■ Browse: Enables the examination of model details for any model in the current database ■ Document Model: Adds a new sheet to the current workbook listing model details ■ Query: Provides a friendly environment for constructing and executing DMX queries against mining models ■ Manage Models: Enables structures and models in the current database to be deleted, renamed, processed, and so on ■ Connection: Manages connections to the Analysis Services database ■ Trace: Provides a history of every command sent to the Analysis Services server. Use of session models for table analysis functions can also be enabled or disabled here. Summary Microsoft Excel has long been the most frequently used tool for analyzing data, and with the advent of the 2007 version, it is easier than ever to include relational and Analysis Services data in those analy- ses. Relational data can be included in data tables that remain linked to the underlying table or query 1591 www.getcoolebook.com . Page 1586 Part X Business Intelligence Installing the data mining add-ins Start by downloading and installing the add-ins. The product page for the add-ins, www .microsoft .com/sqlserver /2008/ en/us/data-mining-addins.aspx ,. Excel 75 FIGURE 75-2 Excel PivotChart based on Analysis Services cube Advanced Data Analysis The SQL Server Data Mining Add-ins for Office 2007 make a number of additional features available in Excel. executing the data mining algorithms requires access to an Analysis Services server, setup installs and provides a link to run the Server Configuration Utility. The configuration wizard will set up a new