438 Hands-On Microsoft SQL Server 2008 Integration Services in auditing. This transformation uses the Advanced Editor to expose its properties. To store the count of rows into a variable, you need to define a variable before you can use this task. The variable must be in the scope of the Data Flow task to which the transformation belongs. To count the rows in the data flow and update the variable value, you simply select the variable in the VariableName field of the Component Properties tab. You can actually have multiple RowCount transformations writing to the same variable, for instance, if you are updating auditing information about loading of data rows you might choose to use same variable in multiple RowCount transformations. However, you need be aware of the fact that the multiple RowCount transformations using same the variable could update variable value multiple times, and hence your auditing could get wrong values, as the variable will keep the last value that it is updated with. Figure 10-17 Configuring Audit transformation Chapter 10: Data Flow Transformations 439 Business Intelligence Transformations The Business Intelligence transformations enable you to maintain a Slowly Changing Dimension (SCD); perform data cleaning, data standardization, and text mining operations; and run data mining prediction queries against data mining models. The Slowly Changing Dimension (SCD) transformation is slightly complex to understand, but it provides a configuration wizard that makes the task of loading a dimension table much easier. You will work through a Hands-On exercise to use an SCD transformation, and you will perform another Hands-On exercise to remove duplicates from data using Lookup, Fuzzy Lookup, and Fuzzy Grouping transformations. Slowly Changing Dimension Transformation SCD helps you manage slowly changing data attributes in a data warehouse. To describe this transformation clearly and for the benefit of those who are new in data warehousing, let’s start with some conceptual details. When a database for a data warehouse is built using the dimensional modeling design, the data is sliced across many dimensions, such as time, location, and product, and is stored in two types of tables called the fact table and the dimension table. A dimension table stores records relevant for that particular dimension, and a fact table contains measures for the dimensions. For example, a product dimension table contains information about products such as product category, model details, and product features. A sales fact table contains measures of sales across different dimensions, such as total sales of a product, total sales in a month, and total sales at a location. The advantage of dimensional data modeling is that you can aggregate data across multiple dimensions—for example, total sales of a product at a location in a year. To learn more about dimensional modeling, refer to Chapter 12. One of the issues you will face while maintaining a data warehouse is how to handle changes to the dimensional data across time. For example, contacts may change their addresses, products can change over time, new products may be introduced, some old products may be removed, or your company may decide to reorganize its sales regions. These changes may pose some challenges when you’re tasked with maintaining the history of these changes. In order to create complex business reports that contain aggregations over a period of time, maintaining history in a data warehouse or online analytical processing (OLAP) system is much more important compared to online transaction processing (OLTP) systems, which are generally designed to represent the current state. For example, if a sales representative is allocated a new sales region, the commission for the sales she has made in the previous region should still go to her; this makes it necessary to keep a history of changes to her sales region. To maintain a history for changing data while dealing with restrictions imposed by data complexity and storage limitations, information engineers have been adopting 440 Hands-On Microsoft SQL Server 2008 Integration Services different approaches to different types of data. When maintaining a history for a particular data type is not important, an existing record simply gets overwritten by a new record. This is the simplest form of handling a slowly changing dimension and is classified as Type 1. When preserving the history of changing attributes of data is critical to business, the new record is added to the dimension along with the existing original record. This is classified as a Type 2 slowly changing dimension. This is a more commonly used form that allows you to maintain accurate historical data; however, if the dimension sizes become too big and the number of rows is too high, you may have to work on storage and performance issues. In a Type 3 slowly changing dimension, the issue of additional rows with each change is addressed by adding new columns for the current value and when the current value became effective. Hence, the dimension contains the original value and the current value, irrespective of how many changes have occurred in between, thus limiting the ability to register changes accurately; this does overcome the limitations of Type 2, however. In Integration Services, the slowly changing dimension transformation doesn’t support Type 3 changes. The SCD transformation supports up to six different outputs to output records for different types of processing before loading them into the dimension table. These outputs have been assigned different names on the basis of different Change Type records they carry. The outputs are discussed in more detail later in this chapter, but here are brief descriptions for you: Changing Attributes Updates Output c Used for the rows where attributes of dimension members are changing. Fixed Attribute Output c Used for the rows where the attributes of dimension members are not allowed to change—i.e., the attributes are fixed. Historical Attributes Inserts Output c Used for the rows where the attributes of dimension members are required to keep history. Inferred Member Updates Output c Used for the rows where the attributes of the dimension members are to be updated as the row was earlier inserted as an inferred member with most of the attributes as blank. New Output c Used for the new member rows and for the current member rows that are to be created as new where the existing member has been updated as expired to keep history. Unchanged Output c Used for the existing members in the dimension table where no attribute has changed and no work is performed by the SCD; such rows are diverted to unchanged output and no data flow is created by default; hence, this acts as a sink for unused rows. However, if you want to capture these rows, you can collect them by creating a data flow path connected to this output. Chapter 10: Data Flow Transformations 441 You can use the Slowly Changing Dimension transformation to update and insert records in data warehouse dimension tables for different types of changes. This transformation contains a wizard that helps you configure various branches relevant to the change types. You set various change types on the columns and select attribute options in a wizard to configure the SCD transformation. When you finish configurations in the end, the wizard configures the SCD transformation to send the different types of records based on their change type settings to different outputs. But the wizard does not stop at simply outputting the records; it also creates downstream data flow to each used output by adding the required components to load data to the SCD table. The great thing about this is that you have complete flexibility to modify the data flow created by the SCD Wizard. When you connect an input to this transformation in the data flow and double-click it, you invoke the Slowly Changing Dimension Wizard. The SCD Wizard allows you to select a dimension table and specify business keys to map transformation input columns to the dimension table columns. Then you can manage changes in the various columns by specifying a change type for each dimension column. The following change types are available while using the SCD Wizard: Fixed Attribute c e Fixed attribute change type for a column indicates that the value in the column is not expected to change and any changes on that column should be treated as errors. At run time, the SCD transformation performs a lookup for the incoming row against the dimension table to check for a match. If SCD doesn’t find any match for a particular row, it diverts that row to the New Output; however, if a match is found and the matching key contains the columns with Fixed attribute change types, the SCD transformation diverts the row to the Fixed Attribute Output. As no change is expected in case of Fixed attribute change type columns, such changes are not expected to be applied to the dimension table, hence the SCD Wizard creates no data flow for this output. However, if you want to capture these rows, you can create a data flow and connect to the Fixed Attribute Output of the SCD transformation. You can also specify to fail the transformation if changes are detected in a Fixed attribute in the wizard. Changing Attribute c e columns that you select for the changing attribute change type are treated as Type 1 changes, and the changed rows will overwrite the existing rows. At run time, the SCD transformation performs a lookup for the incoming row against the dimension table and checks for the match. If a match is found and the matching key has the specified columns with changed values, the transformation directs the row to the Changing Attributes Updates Output. e SCD Wizard also adds an OLE DB Command transformation connected to this output to perform the UPDATE operation on the dimension table. If the SCD 442 Hands-On Microsoft SQL Server 2008 Integration Services transformation finds a matching row during the lookup operation but the row doesn’t contain any change, such rows are diverted to the Unchanged Output and the SCD Wizard doesn’t create any data flow for this output by default. However, if you want to capture such rows, you can create a downstream data flow and attach it to this output. Your data warehouse may contain multiple records for a business key, as it keeps the history depending on what kind of changes it has been through. So when an SCD transformation gets a Changing attribute type record, it may match against multiple records. For this, you can select an option in the wizard to indicate whether the task should change all the matching records, including outdated records, when a change is detected in a Changing attribute. Historical Attribute c e Historical attribute change type on a column creates a new record and marks existing record as expired by changing the date fields or setting a flag in an indicator column. ese changes are treated as Type 2 changes. At run time, when the SCD transformation finds a matching row that has changes in a Historical attribute, it diverts the row to a Historical Attribute Inserts Output. e SCD Wizard creates a data flow, which first updates the existing record and then inserts the incoming record as a new record with the same business key in the dimension table. In the data flow that the SCD Wizard creates, it adds a Derived Column transformation as the first component that is connected to the Historical Attribute Inserts Output. is component adds a new column to the incoming record as an expiry indicator (e.g., IsCurrent = False). e second component added in this data flow branch is an OLE DB Command transformation that updates the existing record in the dimension table with the newly added column to mark it as expired (Set IsCurrent = False where IsCurrent = True). After updating the existing record, it then directs all the columns sans the newly added indicator column of the incoming record to the New Output data flow path using the Union All transformation so that a new record is added to the dimension table. Refer to Figure 10-23, later in the chapter, for clarity on the data flow it creates. If you find this difficult to understand now, complete the Hands- On exercise and it will make much more sense. e New Output also gets all the new records for which SCD did not find any match within the dimension table. e Union All transformation combines new records and the records coming from the OLE DB Command component of the Historical Attribute Inserts Output data flow path for the New Output. After Union All, another Derived Column transformation is added that adds a new column and the value to indicate that the record is current (IsCurrent = True to indicate the current state of the record) before being added to the dimension table. In addition to the outputs specified in these Change Type descriptions, SCD includes one more, called Inferred Member Updates Output. This output gets all the Chapter 10: Data Flow Transformations 443 inferred members’ records. An inferred member record is created in the dimension table with the minimal data, anticipating that more details of the record will arrive when a later loading process runs. An inferred member is created because the fact table contains foreign keys for the dimension tables, and sometimes the loading of the fact table fails because fact data arrives when a key doesn’t yet exist in the dimension table. To avoid the failure of a fact table loading process, you should create a record in the dimension table with minimal data so that when the fact arrives earlier than its dimension member, it can still be loaded. This inferred member record is updated later when the attribute data arrives. Support for uploading inferred members details when they arrive is inbuilt in the SCD transformation, and the SCD Wizard allows you to choose one of the following methods (see Figure 10-22, later in the chapter) to identify an inferred member while uploading: All columns with a change type are null. c Use a Boolean column to indicate whether the current record is an inferred member. c You will use the first option if you have created a minimal record in the dimension table while loading the fact table, when all the columns with a change type have null values. Alternatively, you can use a Boolean column to indicate that the incoming record is an inferred member record. After you specify the inferred member identification and finish the SCD Wizard, you will see that the wizard has created a data flow with lot of data flow components connected to different outputs to load the dimension table. Well, it is time to work through a Hands-On exercise to see how you can load a slowly changing dimension using the SCD Wizard. Later in the exercise you will see that the SCD adds all the downstream components for you that you can change if required. This flexibility takes out the complexity from the development of a package containing SCD and hence makes it quite easy. Hands-On: Loading a Slowly Changing Dimension You are tasked with loading the DimCustomer dimension using a SCD transformation from the Customer table. The challenge is that the customers were registered using a variety of methods, and not all customers were allocated a customer number, so the Customer table contains some records that do not have a Customer ID assigned to them. You need to identify those customers as well. The other business requirements are that changes to phone numbers are allowed and you don’t need to keep a history for these changes; however, a complete history is to be retained for change of address. As an Email attribute is required for logging on to the customer’s portal, this attribute is not allowed to be changed. 444 Hands-On Microsoft SQL Server 2008 Integration Services Method The Campaign database contains a Customer table that you will use to load the DimCustomer dimension table. In real life, DimCustomer would be in a different database, but to keep things simple for this exercise, the DimCustomer table has been created in the Campaign database. The exercise is divided in four parts to work with Slowly Changing Dimension. Exercise (Add Data Flow Task and OLE DB Source to a New Package) In the first part you will create a new package and add starting components to the package. 1. Open the Data Flow transformations project in BIDS and add a New SSIS Package. Rename this new package as Loading Slowly Changing Dimension.dtsx. 2. Drop a Data Flow task from the Toolbox onto the Control Flow surface and rename it Loading SCD. 3. Double-click the Loading SCD to go to the Data Flow panel. Drop an OLE DB source from the Toolbox onto the Data Flow surface. Rename the OLE DB source Customer. 4. Double-click Customer to open the OLE DB Source Editor. Click the New button next to the OLE DB Connection Manager field. Choose localhost. Campaign in the Configure OLE DB Connection Manager dialog box and click OK. Leave Table Or View selected in the Data Access Mode field. Select the [dbo].[Customer] table from the drop-down list in the “Name of the table or the view” field. Go to the Columns page to map the External Columns to the Output Columns. Click OK to close the OLE DB Source Editor. Exercise (Identify Customers that Do Not Have a Customer ID) An SCD transformation needs to have a business key so that it can do a lookup with the dimension table using this key. As a requirement, the SCD transformation doesn’t allow this key to contain null values. In this part, you will filter out all the records coming in the pipeline that have a Customer ID equal to Null. 5. From the Toolbox, drag and drop the Conditional Split transformation below the Customer source. Connect both the components by dragging the green arrow from the Customer and dropping it on the Conditional Split component. Rename the Conditional Split component Customers Filter. 6. Double-click the Customers Filter component to open the Conditional Split Transformation Editor. The interface of this component is basically an Expression Builder UI. Expand the NULL Functions node in the top-right section of the Chapter 10: Data Flow Transformations 445 dialog box and drag and drop ISNULL( <<expression>> ) in the Condition column in the grid. When you click outside the condition field, the expression will turn red, indicating that there is an error with the expression syntax. This will add an output with the name Case 1 to this component. 7. Expand the Columns node in the top-left box and drag the CustomerID on the <<expression>> part of the ISNULL function so that the Condition field contains ISNULL( [CustomerID] ). When you click outside the Condition field, the expression turns black, indicating that the expression syntax is correct. The ISNULL function will be true for the null values of CustomerID column and the component will divert rows that evaluate true to expression to the Case 1 output. 8. Click in the Output Name column and change Case 1 to Customers without CustomerID. Click in the Default Output Name field and rename it Customers with CustomerID, as shown in Figure 10-18. This component has been configured to output records to two outputs on the basis of whether the CustomerID is a null or a non-null value. Click OK to close the editor. 9. Drag an Excel destination onto the Data Flow surface below the Customers Filter. Click the Customers Filter and drag the green arrow onto the Excel Destination. Select Customers without CustomerID in the Output field of the Input Output Selection dialog box. Click OK to return to the Designer. 10. Double-click the Excel destination to open the editor. Click the New button next to OLE DB Connection Manager field. Type C:\SSIS\RawFiles\Customers without CustomerID.xls in the Excel File Path field in the Excel Connection Manager and click OK to return. The specified file will be created at run time. 11. Click the New button next to “Name of the Excel sheet” field and replace Excel Destination written after CREATE TABLE with Customers without CustomerID in the Create Table dialog box; then click OK twice. Select the Customers_without_CustomerID in this field. Go to the Mappings page to create mappings between the input and destination columns. Click OK to close the editor. Exercise (Configure the Slowly Changing Dimension Transformation) Finally, you will configure SCD transformation using the SCD Wizard in this part. 12. Drag the Slowly Changing Dimension transformation from the Toolbox onto the Data Flow surface just below the Customers Filter component. Connect Customers Filter by dragging the green arrow connector to the Slowly Changing Dimension transformation. Rename Slowly Changing Dimension to Loading DimCustomer. 13. Double-click the Loading DimCustomer transformation. This will start the SCD Wizard. Click Next to go to the Select A Dimension Table And Keys screen. 446 Hands-On Microsoft SQL Server 2008 Integration Services 14. Make sure that localhost.Campaign is selected in the Connection Manager field. Select [dbo].[DimCustomer] from the drop-down list in the Table Or View field. As you select the dimension table, you will see that the Input Columns are mapped to the Dimension Columns in the grid. Click in the Key Type column of the CustomerID field and select Business Key from the drop-down list, as shown in Figure 10-19. Click Next to move on. Figure 10-18 Conditionally splitting customers on the basis of the CustomerID column Chapter 10: Data Flow Transformations 447 15. Specify the change types for various columns. As per the requirements set out in the initial statement of this exercise, you are to select all the address fields as the Historical attribute, e-mail as the Fixed attribute, and phone numbers as the Changing attribute. Click in the first row of the Dimension Columns and select Address Line 1 from the drop-down list; then click in the Change Type field to set it as a Historical attribute from the drop-down list. Set the different columns to different change types as per the following table. Dimension Columns Change Type Address Line 1 Historical attribute Address Line 2 Historical attribute Address Line 3 Historical attribute City Historical attribute State Historical attribute Figure 10-19 Selecting Business Key in the SCD Wizard (continued) . engineers have been adopting 440 Hands-On Microsoft SQL Server 2008 Integration Services different approaches to different types of data. When maintaining a history for a particular data type is not. 438 Hands-On Microsoft SQL Server 2008 Integration Services in auditing. This transformation uses the Advanced Editor to expose. to perform the UPDATE operation on the dimension table. If the SCD 442 Hands-On Microsoft SQL Server 2008 Integration Services transformation finds a matching row during the lookup operation