Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1462 Part X Business Intelligence These concepts are also in direct support of the Information Architecture Principle described in Chapter 2, making data ‘‘readily available in a usable format for daily operations and analysis ’’ This chapter describes key concepts and practices behind the BI, enabling the availability and usability that maximizes your data’s value. Data Warehousing Data warehousing is a key concept behind both the structure and the discipline of a BI solution. While increasingly powerful tools provided in the latest release of SQL Server ease some of the requirements around warehousing, it is still helpful to understand these concepts in considering your design. Star schema The data warehouse is the industry-standard approach to structuring a relational OLAP data store. It begins with the idea of dimensions and measures, whereby a dimension is a categorization or ‘‘group by’’ in the data, and the measure is the value being summarized. For example, in ‘‘net sales by quarter and division,’’ the measure is ‘‘net sales,’’ and the dimensions are ‘‘quarter’’ (time) and ‘‘division’’ (organization). Deciding which dimensions and measures to include in the warehouse should be based on the needs of the business, bringing together an understanding of the types of questions that will be asked and the semantics of the data being warehoused. Interviews and details about existing reports and metrics can help gain a first approximation, but for most organizations a pilot project is needed to fully define requirements. Best Practice O rganizations that are not familiar with what BI solutions can deliver have a difficult time understanding the power they represent. Kick-start your effort by developing a simple prototype that demonstrates some of what is possible using data that everyone understands. Choose a small but relevant subset of data and implement a few dimensions and one or two measures to keep implementation time to a minimum. Business needs can then provide a basis for building the star schema that is the building block of the warehouse (see Figure 70-1). The star schema derives its name from its structure: a central fact table and a number of dimension tables clustered around it like the points of a star. Each dimension is connected back to the fact table by a foreign key relationship. 1462 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1463 BI Design 70 FIGURE 70-1 Simple star schema dimDrug dimDiagnosis dimPatient dimSite dimStatus dimDate dimPhysician factAdim dimTherapy dimStage The fact table consists of two types of columns: the keys that relate to each dimension in the star and the facts (or measures) of interest. Each dimension table consists of a primary key by which it relates back to the fact table, and one or more attributes that categorize data for that dimension. For example, a customer dimension may include attributes for name, e-mail address, and zip code. In general, the dimension represents a denormalization of the data in the OLTP system. For example, the AdventureWorksDW customer dimension is derived from the AdventureWorks tables Sales.Individual and Person.Contact, and fields parsed from the XML column describing demographics, among others. Snowflake schema Occasionally, it makes sense to limit denormalization by making one dimension table refer to another, thus changing the star schema into a snowflake schema. For example, Figure 70-2 shows how the product dimension has been snowflaked in AdventureWorks’ Internet Sales schema. Product category and subcategory information could have been included directly in the product DimProduct table, but instead separate tables have been included to describe the categorizations. Snowflakes are useful for complex dimensions for which consistency issues might otherwise arise, such as the assignment of subcategories to categories in Figure 70-2, or for large dimensions where storage size is a concern. 1463 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1464 Part X Business Intelligence FIGURE 70-2 Snowflake dimension DimProductSubcategory (d ProductSubcategoryKey ProductSubcategoryAlterna EnglishProductSubcategory SpanishProductSubcategory FrenchProductSubcategory ProductCategoryKey DimProductCategory (dbo.D ProductCategoryKey ProductCategoryAlternat EnglishProductCategoryN SpanishProductCategory FrenchProductCategoryN DimProduct (dbo.DimProd ProductKey ProductAlternateKey ProductSubcategoryKey WeightUnitMeasureCode SizeUnitMeasureCode EnglishProductName SpanishProductName FrenchProductName StandardCost FinishedGoodsFlag Color FactInternetSales (dbo.Fa SalesOrderNumber SalesOrderLineNumber ProductKey OrderDateKey DueDateKey ShipDateKey CustomerKey PromotionKey CurrencyKey SalesTerritoryKey RevisionNumber OrderQuantity UnitPrice ExtendedAmount UnitPriceDiscountPct DiscountAmount ProductStandardCost TotalProductCost SalesAmount TaxAmt Traditionally, snowflake schema have been discouraged because they add complexity and can slow SQL operations, but recent versions of SQL Server eliminate the majority of these issues. If a dimension can be made more consistent using a snowflake structure, then do so unless: (1) the procedure required to publish data into the snowflake is too complex or slow to be sustainable or (2) the schema being designed will be used for extensive SQL queries that will be slowed and complicated by the snowflake design. Surrogate keys Foreign key relationships are the glue that holds the star schema together, but avoid using OLTP keys to relate fact and dimension tables, even though it is often very convenient to do so. Consider a star schema containing financial data when the accounting package is upgraded, changing all the customer IDs in the process. If the warehouse relation is based on an identity column, changes in the OLTP data can be accomplished by changes to the relatively small amount of data in the customer dimension. If the OLTP key were used, then the entire fact table would need to be converted. Instead, create local surrogate keys, usually an identity column on dimension tables, to relate to the fact table. This helps avoid ID conflicts and adds robustness and flexibility in a number of scenarios. Consistency Defining the star schema enables fast OLAP queries against the warehouse, but gaining needed consis- tency requires following some warehousing rules: ■ When loading data into the warehouse, null and invalid values should be replaced with their reportable equivalents. This enables the data’s semantics to be researched carefully once and 1464 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1465 BI Design 70 then used by a wide audience, leading to a consistent interpretation throughout the organiza- tion. Often, this involves manually adding rows to dimension tables to allow for all the cases that can arise in the data (e.g., Unknown, Internal, N/A, etc.). ■ Rows in the fact table should never be deleted, and no other operations should be performed that will lead to inconsistent query results from one day to the next. Often this leads to delaying import of in-progress transactions. If a summary of in-progress transactions is required, keep them in a separate fact table of ‘‘provisional’’ data and ensure that the business user is presented with distinct summaries of data that will change. Nothing erodes confidence in a system like inconsistent results. A key design consideration is the eventual size of the fact table. Fact tables often grow such that the only practical operation is the insertion of new rows — large-scale delete and update operations become impractical. In fact, database size estimates can ignore the size of dimension tables and use just the size of the fact tables. Best Practice K eeping large amounts of history in the warehouse doesn’t imply keeping data forever. Plan to archive from the beginning by using partitioned fact tables that complement the partitioning strategy used in Analysis Services. For example, a large fact table might be broken into monthly partitions, maintaining two years of history. Loading data Given the architecture of the star/snowflake schema, adding new data begins at the points and moves inward, adding rows to the fact table last in order to satisfy the foreign key constraints. Usually, warehouse tables are loaded using Integration Services, but examples are shown here as SQL inserts for illustration. Loading dimensions The approach to loading varies with the nature of the source data. In the fortunate case where the source fact data is related by foreign key to the table containing dimension data, only the dimension data needs to be scanned. This example uses the natural primary key in the source data, product code, to identify rows in the Products staging table that have not yet been added to the dimension table: INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName) SELECT stage.Code, stage.Name FROM Staging.dbo.Products stage LEFT OUTER JOIN Warehouse.dbo.dimProduct dim ON stage.Code=dim.ProductCode WHERE dim.ProductCode is NULL; Often, the source dimension data will not be related by foreign key to the source fact data, so loading the dimension table requires a full scan of the fact data in order to ensure consistency. 1465 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1466 Part X Business Intelligence This next example scans the fact data, picks up a corresponding description from the dimension data when available, or uses ‘‘Unknown’’ as the description when none is found: INSERT INTO Warehouse.dbo.dimOrderStatus (OrderStatusID, OrderStatusDesc) SELECT DISTINCT o.status, ISNULL(mos.Description,’Unknown’) FROM Staging.dbo.Orders o LEFT OUTER JOIN Warehouse.dbo.dimOrderStatus os ON o.status=os.OrderStatusID LEFT OUTER JOIN Staging.dbo.map_order_status mos ON o.status = mos.Number WHERE os.OrderStatusID is NULL; Finally, a source table may contain both fact and dimension data, which opens the door to inconsis- tent relationships between dimension attributes. The following example adds new codes that appear in the source data, but guards against multiple product name spellings by choosing one with an aggregate function. Without using MAX here, the query may return multiple rows for the same product code: INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName) SELECT stage.Code, MAX(stage.Name) FROM Staging.dbo.Orders stage LEFT OUTER JOIN Warehouse.dbo.dimProduct dim ON stage.Code=dim.ProductCode WHERE dim.ProductCode is NULL; Loading fact tables Once all the dimensions have been populated, the fact table can be loaded. Dimension primary keys generally take one of two forms: the key is either a natural key based on dimension data (e.g., ProductCode) or it is a surrogate key without any relationship to the data (e.g., the identity column). Surrogate keys are more general and adapt well to data from multiple sources, but each surrogate key requires a join while loading. For example, suppose our simple fact table is related to dimTime, dimCustomer,anddimProduct.IfdimCustomer and dimProduct use surrogate keys, the load might look like the following: INSERT INTO Warehouse.dbo.factOrder (OrderDate, CustomerID, ProductID, OrderAmount) SELECT o.Date, c.CustomerID, p.ProductID, ISNULL(Amount,0) FROM Staging.dbo.Orders o INNER JOIN Warehouse.dbo.dimCustomer c ON o.CustCode = c.CustomerCode INNER JOIN Warehouse.dbo.dimProduct p ON o.Code = p.ProductCode; Because dimTime is related to the fact table on the date value itself, no join is required to determine the dimension relationship. Measures should be converted into reportable form, eliminating nulls when- ever possible. In this case, a null amount, should it ever occur, is best converted to 0. 1466 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1467 BI Design 70 Best Practice T he extract-transform-load (ETL) process consists of a large number of relatively simple steps that evolve over time as source data changes. Centralize ETL logic in a single location as much as possible, document non-obvious aspects, and place it under source control. When some aspect of the process requires maintenance, this will simplify rediscovering all the components and their revision history. Integration Services and SourceSafe are excellent tools in this regard. Changing data in dimensions Proper handling of changes to dimension data can be a complex topic, but it boils down to how the organization would like to track history. If an employee changes her last name, is it important to know both the current and previous values? How about address history for a customer? Or changes in credit rating? Following are the four common scenarios for tracking history in dimension tables: ■ Slowly Changing Dimension Type 1: History is not tracked, so any change to dimen- sion data applies across all time. For example, when the customer’s credit rating changes from excellent to poor, there will be no way to know when the change occurred or that the rating was ever anything but poor. Such tracking makes it difficult to explain why the cus- tomer’s purchase order was accepted last quarter without prepayment. Conversely, this simple approach will suffice for many dimensions. When implementing an Analysis Services database on OLTP data instead of a data warehouse, this is usually the only option available, as the OLTP database rarely tracks history. ■ Slowly Changing Dimension Type 2: Every change in the source data is tracked as history by multiple rows in the dimension table. For example, the first time a customer appears in OLTP data, a row is entered into the dimension table for that customer and corresponding fact rows are related to that dimension row. Later, when that customer’s information changes in the OLTP data, the existing row for that customer is expired, and a new row is entered into the dimension table with the new attribute data. Future fact rows are then associated with this new dimension table row. Because multiple surrogate keys are created for the same customer, aggregations and distinct counts must use an alternate key. Generally, this alternate key will be the same one used to match rows when loading the dimension (see ‘‘Loading dimensions’’ earlier in this chapter). ■ Slowly Changing Dimension Type 3: Combines both type 1 and 2 concepts, whereby history on some but not all changes is tracked based on business rules. Perhaps employee transfers within a division are treated as type 1 changes (just updated), while transfers between divisions are treated as type 2 (a new dimension row is inserted). ■ Rapidly Changing Dimension: Occasionally an attribute (or a few attributes) in a dimension will change rapidly enough to cause a type 2 approach to generate too many records in the dimension table. Such attributes are often related to status or rankings. This approach resolves the combinatorial explosion by breaking the rapidly changing attributes out into a separate dimension tied directly to the fact table. Thus, instead of tracking changes as separate rows in the dimension table, the fact table contains the current ranking or status for each fact row. 1467 www.getcoolebook.com Nielsen c70.tex V4 - 07/21/2009 4:21pm Page 1468 Part X Business Intelligence Accommodating possible changes to dimensions complicates our ‘‘loading dimensions’’ example from the previous section. For example, a Type 1 dimension requires that each dimension value encountered must first be checked to see if it exists. If it does not, then it can be inserted as described. If it does exist, then a check is performed to determine whether attributes have changed (e.g., whether the employee name changed), and an update is performed if required. This type of conditional logic is another reason to use Integration Services to load data, as it simplifies the task and performs most operations quickly in memory. A common practice for detecting changes is to create and store a checksum of the current data values in the dimension, and then calculate the checksum on each dimension row read, comparing the two checksums to determine whether a change has occurred. This practice minimizes the amount of data read from the database when performing comparisons, which can be substantial when a large number of attributes are involved. While there is a small chance that different rows will return the same checksum, the risk/reward of this approach is frequently judged acceptable. Type 2 dimensions extend the idea of Type 1’s insert-or-update regimen. If a dimension row does not exist, it is still inserted, but if it exists two things need to happen. First, the existing row of data is expired — it continues to be used by all previously loaded fact data, but new fact data will be associated with the new dimension row. Second, the new dimension row is inserted and marked as active. There are a number of ways to accomplish this expire/insert behavior, but a common approach adds three columns to the dimension table: effective start and end dates, plus an active flag. When a row is initially inserted, the effective start is the date when the row is first seen, the end date is set far in the future, and the active flag is on. Later, when a new version of the row appears, the effective end date is set and the active flag is cleared. It is certainly possible to use only the active flag or date range alone, but including both provides both easy identification of the current rows and change history for debugging. Summary All organizations use BI whether they realize it or not, because every organization needs to measure what is happening in its business. The only alternative is to make every decision based on intuition, rather than data. The quest for BI solutions usually begins with simple queries run against OLTP data and evolves into an inconsistent jumble of numbers. The concepts of BI and data warehousing help organize that chaos. Storing data in a separate warehouse database avoids the contention, security, history, and consistency pitfalls associated with directly using OLTP data. The discipline of organizing dimensions and facts into a star schema using report-ready data delivers both the quality and the performance needed to effectively manage your data. The concepts introduced in this chapter should help prepare and motivate you to read the Analysis Services and Integration Services chapters in this section. 1468 www.getcoolebook.com Nielsen c71.tex V4 - 07/21/2009 3:53pm Page 1469 Building Multidimensional Cubes with Analysis Services IN THIS CHAPTER Concepts and terminology of Analysis Services How to build up the components of a multidimensional database Detailed exploration of dimensions and cubes Dimension checklist Configuring database storage and partitions Designing aggregations to increase query speed Configuring data integrity options A naïve view of Analysis Services would be that it is the same data used in relational databases with a slightly different format. Why bother? One could get by with data relational format only, without needing new technology. One can build a house with only a handsaw as well — it is all about the right tool for the job. Analysis Services is fast. It serves up summaries of billions of rows in a second — a task that would take relational queries several minutes or longer. And unlike creating summary tables in a relational database, you don’t need to create a different data structure for each type of summary. Analysis Services is all about simple access to clean, consistent data. Building a database in Analysis Services eliminates the need for joins and other query-time constructs requiring intimate knowledge of the underlying data structures. The data modeling tools provide methods to handle null and inconsistent data. Com- plex calculations, even those involving period over period comparisons, can be easily constructed and made to look like just another item available for query to the user. Analysis Services also provides simple ways to relate data from disparate systems. The facilities provided in the server combined with the rich design environment provide a compelling toolkit for data analysis and reporting. Analysis Services Quick Start One quick way to get started with both data warehousing and Analysis Services is to let the Business Intelligence Development Studio build the Analysis Services database and associated warehouse tables for you, based on templates shipped with SQL Server. Begin by identifying or creating a SQL Server warehouse 1469 www.getcoolebook.com Nielsen c71.tex V4 - 07/21/2009 3:53pm Page 1470 Part X Business Intelligence (relational) database. Then open Business Intelligence Development Studio and create a new Analysis Services project. Right-click on the Cubes node in the Solution Explorer and choose New Cube to begin the Cube Wiz- ard. On the Select Creation Method page of the wizard, choose ‘‘Generate tables in the data source’’ and choose the template from the list that corresponds to your edition of SQL Server. Work through the rest of the wizard choosing measures and dimensions that make sense in your business. Be sure to pause at the Define Time Periods page long enough to define an appropriate time range and periods to make the time dimension interesting for your application. At the Completing the Wizard page, select the Generate Schema Now option to automatically start the Schema Generation Wizard. Work through the remaining wizard pages, specifying the warehouse loca- tion and accepting the defaults otherwise. At the end of the Schema Generation Wizard, all the Analysis Services and relational objects are created. Even if the generated system does not exactly meet a current need, it provides an interesting example. The resulting design can be modified and the schema regener- ated by right-clicking the project within the Solution Explorer and choosing Generate Relational Schema at any time. Analysis Services Architecture Analysis Services builds on the concepts of the data warehouse to present data in a multidimensional format instead of the two-dimensional paradigm of the relational database. How is Analysis Services mul- tidimensional? When selecting a set of relational data, the query identifies a value via row and column coordinates, while the multidimensional store relies on selecting one or more items from each dimension to identify the value to be returned. Likewise, a result set returned from a relational database is a series of rows and columns, whereas a result set returned by the multidimensional database can be organized along many axes depending on what the query specifies. Background on Business Intelligence and Data Warehousing is presented in Chapter 70, ‘‘BI Design.’’ Readers unfamiliar with these areas will find this background helpful for under- standing Analysis Services. Instead of the two-dimensional table, Analysis Services uses the multidimensional cube to hold data in the database. The cube thus presents an entity that can be queried via multidimensional expressions (MDX), the Analysis Services equivalent of SQL. Analysis Services also provides a convenient facility for defining calculations in MDX, which in turn pro- vides another level of consistency to the Business Intelligence information stream. See Chapter 72, ‘‘Programming MDX Queries,’’ for details on creating queries and calcula- tions in MDX. Analysis Services uses a combination of caching and pre-calculation strategies to deliver query perfor- mance that is dramatically better than queries against a data warehouse. For example, an existing query to summarize the last six months of transaction history over some 130 million rows per month takes a few seconds in Analysis Services, whereas the equivalent data warehouse query requires slightly more than seven minutes. 1470 www.getcoolebook.com Nielsen c71.tex V4 - 07/21/2009 3:53pm Page 1471 Building Multidimensional Cubes with Analysis Services 71 Unified Dimensional Model The Unified Dimensional Model (UDM) defines the structure of the multidimensional database, includ- ing attributes presented to the client for query and how data is related, stored, partitioned, calculated, and extracted from the source databases. At the foundation of the UDM is a data source view that identifies which relational tables provide data to Analysis Services and the relations between those tables. In addition, the data source view supports giving friendly names to included tables and columns. Based on the data source view, measure groups and dimensions are defined according to data warehouse facts and dimensions. Cubes then define the relations between dimensions and measure groups, forming the basis for multidimensional queries. Server The UDM, or database definition, is hosted by the Analysis Services server, as shown in Figure 71-1. FIGURE 71-1 Analysis Services server XMLA via TCP/IP UDM Analysis Services Server Storage Relational Database(s) Processing & Caching SSIS Pipeline MOLAP Storage Data can be kept in a Multidimensional OLAP (MOLAP) store, which generally results in the fastest query times, but it requires pre-processing of source data. Processing normally takes the form of SQL queries derived from the UDM and sent to the relational database to retrieve underlying data. Alternately, data can be sent directly from the Integration Services pipeline to the MOLAP store. In addition to storing measures at the detail level, Analysis Services can store pre-calculated summary data called aggregations. For example, if aggregations by month and product line are created as part of the processing cycle, queries that require that combination of values do not have to read and summarize the detailed data, but can use the aggregations instead. 1471 www.getcoolebook.com . on templates shipped with SQL Server. Begin by identifying or creating a SQL Server warehouse 1469 www.getcoolebook.com Nielsen c71.tex V4 - 07/21/2009 3:53pm Page 1470 Part X Business Intelligence (relational). multidimensional queries. Server The UDM, or database definition, is hosted by the Analysis Services server, as shown in Figure 71-1. FIGURE 71-1 Analysis Services server XMLA via TCP/IP UDM Analysis Services Server Storage Relational Database(s) Processing &. snowflake schema have been discouraged because they add complexity and can slow SQL operations, but recent versions of SQL Server eliminate the majority of these issues. If a dimension can be made