398 Hands-On Microsoft SQL Server 2008 Integration Services specify in the Cache Size fields. Two Cache Size fields have been provided in the Partial caching section—one is for a 32-bit system and the other one applies to a 64-bit system. What happens with partial cache mode is that the partial reference data rows are loaded into the cache based on the cache size value and as the lookup operation applies, the input data rows are matched against the reference data in the cache. If a match is found, the row is directed to the match output, and if a match is not found in the cache, a query is sent to the OLE DB source to find if a match exists. Again, if a match is found the row is sent to the match output; however, if a match is not found in the reference database the non-matching row is directed to the error output or no-match output, depending on how you’ve chosen to handle the no-match rows. is trip to an external reference database to find a match or no match could be quite expensive and can hurt the performance of a lookup operation. If your reference data is not changing while the package is running and you’ve sufficient free memory available on the server, you can choose to store the no-matching rows in the cache using the option “Enable cache for rows with no matching entries.” You can also apply a limit on the amount of cache no matching rows can use by specifying a percentage value in the Allocation From Cache field. e next time the no-matching row with the same key appears in the input data set, the lookup transformation already knows about this, as it has it in the cache and doesn’t do a trip out to the database. is process goes on till the cache is completely filled up, at which time the Lookup transformation removes the least frequently used rows from the cache. You can also improve the query performance by selecting the Modify The SQL Statement check box. Using this option, you can use an SQL statement to select a subset of the reference data set. For example, if you have selected a table or a view as the reference data set in the Connection page that contains address details for the whole country and your input records contain records for a state only, you don’t need to load all the reference data set in the memory; rather, you need to load only the particular state’s address details. To limit the reference data set dynamically, you can also use values or parameters in the WHERE clause of the SQL statement and use the Parameters button to map parameters to input columns. When you use parameters in the WHERE clause mapped to input columns, these parameters are first evaluated from the input data set at run time and limit the data set that needs to be referenced and hence increases overall performance. No cache c is option does not load reference data set into the cache; rather, it makes a connection to the external data source using an OLE DB Connection Manager to perform the lookup operation. is option works like partial cache mode with a difference that all the input rows are matched against data in the database table, other than the last retrieved row, which always resides in the cache. Chapter 10: Data Flow Transformations 399 When you choose this option, the Partial Caching section in the Advanced option is disabled as there is no cache to configure. However, the Custom query can be used to modify the SQL statement that is applied in the Connection page. You can specify the connection settings for the reference data set in the Connection page. The graphical interface in this page changes to match the options you have selected in the General page. When you select full cache mode, you can choose a Connection type of either a Cache Connection Manager or an OLE DB Connection Manager. If you choose the Cache Connection Manager option, the Connection page has only one field in which you can specify the Cache Connection Manager. However, if you choose an OLE DB Connection Manager in the case of full cache mode or are using a partial cache mode or no cache mode, the interface of the Connection page changes to allow you to specify settings for an OLE DB Connection Manager. In this case, the reference data set can be a table or view in a database or can be a result of an SQL statement. So, you specify an OLE DB Connection Manager, select a table or view or an SQL query if you prefer to use an SQL query in the Connection page. In this page, you can either select an existing table or create a new table by clicking the New button provided for the “Use a table or a view” option and can populate this table at run time before this Lookup transformation executes. If you select “Use results of an SQL query,” you can either type your SQL statement directly in the Option field or you can build the query with the Query Builder, which you can invoke by clicking Build Query. You can also read the query from a file by clicking Browse. You specify the join columns in the Columns tab and can also select reference columns that you want to include in the data flow. To make a join, drag a column from Available Input Columns and drop it on the column you want to join with in the Available Lookup Columns list. You won’t be able to map or join two columns that don’t have matching data types. To select a column to include in the output, check the box provided for the Available Lookup Columns. As you select a column, it will be listed in the lower half of the editor window, where you can select the lookup operation and assign an output alias as well to the selected column. You can add the selected column as a new column or replace the values of the same column in the input columns. Finally, note a few cautions when you add this transformation in the data flow. The Lookup transformation’s lookup operation is case-sensitive—i.e., the string comparison in the lookup is case sensitive—while this may not be the case with your database if you’re using a case-insensitive collation. So you need to be careful with the input data set case, vis-à-vis the case of the reference data set, and you also need to understand where the comparison is made. If you’re using full cache mode, the comparison is made within the lookup transformation and is case sensitive; however, if you are using no-cache mode, then the lookup operation is made in the database engine and, based 400 Hands-On Microsoft SQL Server 2008 Integration Services on the database collation, could be case-insensitive. Similarly, you need to be clear how you are configuring your cache in the partial cache mode and where the comparison will happen. One solution may be to convert both the string data sets to either lowercase or uppercase for this transformation. Another reason for lookup failures could be the null values in data. If the lookup operation matches columns that contain null values, it will fail when you are using no or partial caching and the comparison happens in the database engine instead of in the lookup transformation. A possible solution for this could be either to use full caching mode or to remove null values by modifying the SQL statement. Last, you can join (map) columns in the lookup transformation with any data type other than DT_R4, DT_R8, DT_TEXT, DT_NTEXT, or DT_IMAGE because they are not supported data types for joins. Hands-On: Updating PersonContact Data You receive contact details for persons interested in your products from different sources. As a policy, you standardize data as you insert it into a PersonContact table by performing lookups for address details on the basis of postal code, which is a mandatory requirement for the data to be accepted. This data mart will be used for marketing purposes, and hence you want to make sure that all the contacts have a Salutation and there are no duplicated contacts in the table. Method In this exercise, you receive data in two different file formats, the Microsoft Excel file format and the flat file format, and you will load this data into a PersonContact table in the Campaign database after applying the required transformations. In the data flow of this package, you’ll combine the data from these two files, format it to the required data types, generate a salutation in the data flow, perform a lookup on the postal code to get the correct city, and delete duplicate records before loading it into the table. Here is the step-by-step process: Add data flow sources to get the data in the data flow. C Convert the data to match the data types outputted by both the data flow sources C and then combine the data from both data streams. Derive a salutation. C Enhance address details. C Chapter 10: Data Flow Transformations 401 Delete duplicates and insert data into the PersonContact table. c Execute the package to see the results. c Exercise (Add Data Flow Sources) You will create a new Integration Services project that will subsequently be used for all the packages developed in this chapter. You will add a data flow in the package and then will add data flow sources in the data flow. 1. Create a new Integration Services project with the following details: Name Data Flow transformations Location C:\SSIS\Projects 2. When the blank project is created, rename the package as Updating PersonContact.dtsx. 3. Drop the Data Flow task from the Toolbox onto the Control Flow Designer surface and then double-click it to go to the Data Flow tab. 4. In the Data Flow tab, drag and drop a Flat File source and an Excel source from the Toolbox onto the Designer surface. 5. Double-click the Flat File source to open the Flat File Source Editor. Click the New button opposite the Flat File Connection Manager field to create a connection manager. 6. In the Flat File Connection Manager Editor, type PersonDetails01 in the Connection Manager Name field. Specify C:\SSIS\RawFiles\PersonDetails01.txt in the File Name field. Select the check box for the “Column names in the first data row” option. The rest of the options will be automatically filled for you with default, as shown in Figure 10-1. 7. Go to the Columns page and the columns will be listed for you as the Flat File Connection Manager selects {CR}{LF} as the row delimiter and comma {,} as column delimiter values automatically. 8. Go to the Advanced page where you can specify the column data type details. In this exercise, our final destination is the PersonContact table, so you can check the schema of this table and update column data types accordingly. The following table shows these details, which you need to apply to the column properties in this page. For example, to specify settings for the Postcode column, click this column and then in the right side pane, go to the OutputColumnWidth field and change it to 12. Make sure DataType is set to String [DT_STR]. Similarly apply 402 Hands-On Microsoft SQL Server 2008 Integration Services the following settings to all the columns. After having applied all the settings, click OK to close the page and return to the Flat File Source Editor. Name OutputColumnWidth DataType FirstName 50 string [DT_STR] LastName 50 string [DT_STR] Gender 1 string [DT_STR] Married 1 string [DT_STR] AddressLine1 255 string [DT_STR] AddressLine2 255 string [DT_STR] Postcode 12 string [DT_STR] Figure 10-1 Flat File Connection Manager settings Chapter 10: Data Flow Transformations 403 9. Go to the Columns page and check out that all the Available External Columns have been selected. Click OK to close the editor. Right-click the Flat File source and rename it PersonDetails01. 10. Right-click the Excel source and rename it PersonDetails02. Double-click PersonDetails02 to open the Excel Source Editor. Click the New button provided opposite to the OLE DB Connection Manager field. 11. In the Excel Connection Manager dialog box, specify C:\SSIS\RawFiles\ PersonDetails02.xls in the Excel File Path field. Leave Microsoft Excel 97-2003 specified in the Excel Version field, and leave the check box selected for “First row has column names.” Click OK to create this connection manager. 12. In the Excel Source Editor, leave “Table or view selected” in the Data Access Mode field. Select PersonDetails02$ in the “Name of the Excel sheet” field from the drop-down list and go to the Columns page to see that all the Available External Columns have been selected. Click OK to close this editor. 13. Rename the Excel Connection Manager to PersonDetails02. Exercise (Combine Two Data Streams) In this part of the exercise, you will add a Data Conversion transformation to convert the data type of columns coming from the Excel source to match with the data types of columns coming through Flat File source and then use Union All transformation to combine these two data streams. 14. Right-click the Excel source and choose Show Advanced Editor from the context menu. Go to the Input and Output Properties tab, expand the Excel Source Output and check out the properties of columns listed in External Columns and Output Columns. Note that these columns are of the Unicode string [DT_ WSTR] data type with column width (Length) equals 255. These columns need to be converted to the data type so that you can combine them with the columns from the Flat File source. Close the Advanced Editor. 15. Drop a Data Conversion transformation from the Toolbox onto the Data Flow surface just below the Excel source. Join the Excel source with this transformation using the green arrow. 16. Double-click the Data Conversion transformation to open its editor. As you want to convert the data type of all the columns, select the check boxes provided in front of all the columns in the Available Input Columns. As you select the columns, a row for each column will be added in the lower grid section, where you can configure the changes. 17. For the FirstName Input Column, note its Output Alias, which means the converted column will be added to the transformation output as Copy of FirstName. Click in the Data Type field and change it to string [DT_STR] from the drop-down list 404 Hands-On Microsoft SQL Server 2008 Integration Services and change the Length to 50. A Code Page will be selected by default based on the code page of the computer; however, you can change it if you need to by selecting a different code page from the drop-down list. Similarly, change settings for all the columns as shown in the Figure 10-2. 18. Click OK to close the transformation. Rename the Data Conversion transformation Converting PersonDetails02. 19. Drop the Union All transformation from the Toolbox on the Data Flow surface between PersonDetails01 and Converting PersonDetails02. Join both of these components to the Union All transformations using green arrows. Figure 10-2 Converting Excel source data using Data Conversion transformation Chapter 10: Data Flow Transformations 405 20. Double-click the Union All transformation to open the editor. Here, you will see Output Column Name columns automatically mapped to Union All Input 1 columns, whereas you will need to map Union All Input 2 columns yourself. Click in the FirstName column under Union All Input 2 and select the converted Copy of FirstName column. If you select FirstName in this column, you will get an exception error for incompatible data type. Map all the output columns to the converted columns as you did for FirstName column (Figure 10-3). 21. Click OK to close the editor. Rename the Union All transformation Merging PersonDetails01 and PersonDetails02. Figure 10-3 Combining two data streams using the Union All transformation 406 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Derive Salutation) You will use a Derived Column transformation in this part to derive a salutation. You need to derive a salutation as Mr., Mrs., or Miss. 22. Drop a Derived Column transformation from the Toolbox below the Merging PersonDetails01 and PersonDetails02 and join both of them by dragging the green arrow onto the Derived Column transformation. 23. Double-click the Derived Column transformation to open the editor. This transformation has an Expression Builder interface. Click in the Derived Column Name field and type Salutation. By default, <add as new column> will be displayed in the Derived Column field, which you can change to replace an existing column from the drop-down list. 24. Type the following expression in the Expression field (refer to Figure 10-4): Gender == "m" ? "Mr." : (Gender == "f" && Married == "y" ? "Mrs." : "Miss") Figure 10-4 Deriving a salutation using Derived Column transformation Chapter 10: Data Flow Transformations 407 In this expression, you are using a conditional operator that returns one of the two expressions based on the evaluation of the Boolean expression. 25. Select string [DT_STR] from the drop-down list in the Data Type field and specify 4 in the Length column if not automatically specified. If you can’t change the data type here, you can in the Input and Output Properties tab of the Advanced Editor. Click OK to close the editor. Rename this transformation Deriving Salutation. Exercise (Enrich Address Details) One of the requirements you will address in this exercise is to standardize the address detail by including City on the basis of the Postcode available in the data. You’ve a table called PostalCity in the Campaign database that contains a list of cities against postal codes, and you will perform a lookup operation to get the correct city against a postal code. As a lookup operation is a case-sensitive operation in SSIS, you will convert the Postcode column in both places—i.e., in the pipeline as well as in the reference data to UPPERCASE. Let’s first use a Character Map transformation to convert the pipeline data. 26. Drop a Character Map transformation from the Toolbox on the Data Flow Designer. Join the Character Map transformation with Deriving Salutation by dragging and dropping the green arrow from the latter. 27. Double-click the Character Map transformation to open the editor. Click in the check box next to the Postcode column to select it. A row will be added in the lower grid area with Input Column selected as Postcode. 28. Check that the Destination column in the lower grid has only two possible values: New Column and In-place Change. Select the In-Place Change value in the Destination column. Note that the Output Alias is also changed to Postcode (see Figure 10-5). 29. Invoke the drop-down list of available operations in the Operation field. Select Uppercase by clicking the check box in front of it. Click OK to complete your selection. 30. Click OK to close the editor. Rename Character Map transformation Uppercasing Postcode. 31. Now drop a Lookup transformation on the Data Flow surface after the Uppercasing Postcode transformation. Join both of these transformations using a green data flow path. 32. Double-click the Lookup transformation to open the editor. Select the Partial Cache mode in the General page. 33. In our data stream, some of the postal codes will not find matches for cities, so they will fail the Lookup transformation. You will configure this transformation to divert these mismatching records to a flat file so that you can review them later on. Select the “Redirect rows to no match output” option under the “Specify how to handle rows with no matching entries” field. . 398 Hands-On Microsoft SQL Server 2008 Integration Services specify in the Cache Size fields. Two Cache Size fields have been provided in the Partial caching section—one. 400 Hands-On Microsoft SQL Server 2008 Integration Services on the database collation, could be case-insensitive. Similarly, you need to be clear how you are configuring your cache in the partial. Union All transformation 406 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Derive Salutation) You will use a Derived Column transformation in this part to derive a salutation.