468 Hands-On Microsoft SQL Server 2008 Integration Services Hands-On: Removing Duplicates from Owners Data The Fuzzy Lookup and Fuzzy Grouping transformations are the data flow components that you can use for data cleaning purposes. One of the main issues with data quality is to remove duplication in data, whether it occurs at loading time or it already exists in the data that you want to cleanse. Until now, you’ve worked on a couple of instances to remove duplicates in the earlier Hands-On exercises, but those instances were dealing with exact duplicates. In this exercise, you will deal with fuzzy duplication of data. You will work with these components to remove exact as well as fuzzy duplicates from the input data. The scenario is that you are maintaining an Owner table that contains contact details for the owners of your products. You regularly receive an Owner’s data feed that sometimes contains duplicate data. This duplicate data is not consistent, as users tend to provide their details differently at different occasions. You need to make sure no duplicate record is added to the Owner table. Method You have the Owner table in the Campaign database and receive OwnersFeed.xls files regularly that contain owner records. This Excel file can contain duplicate records for the same person. The complication, however, is that these records may not be exact duplicates, as persons provide their contact details differently at different occasions. Our sample OwnersFeed.xls file contains 13 records, of which 3 are unique records and the other 2 records have five variants each, with different name spellings and address details; one of these 2 records already exists in the Owner table. Open the OwnersFeed .xls file to have a look at the incoming data (see Figure 10-25). Figure 10-25 Incoming data contains variants of duplicate records Chapter 10: Data Flow Transformations 469 The owner with first name Johnathon already exists in the table and has five different variants of contact details. The other owner with first name Kathrine is a new owner, but has five different variants of contact details in the feed. You want to load only four records into the Owner table consisting of three unique records plus an owner record for Kathrine, while all the duplicate variants and the owner records for Johnathon should be removed. You will remove the exact duplicates from the OwnersFeed file using the Sort and Lookup transformation. After removing exact and matched duplicates, you will use the Fuzzy Lookup transformation to remove fuzzy matched duplicates. At this stage only the duplicate rows that exist in variant forms in the pipeline would be left. You will remove these duplicate variants using the Fuzzy Grouping transformation. Exercise (Create Removing Duplicate Owners Package) Start by creating a new package, adding the Data Flow task into this package, and then adding an Excel source to get the data in the data flow. 1. Open the Data Flow transformations project in BIDS. Add a New SSIS Package in the project and then rename it as Removing Duplicates.dtsx. 2. Drop a Data Flow task from the Toolbox onto the Control Flow surface and rename it Removing Duplicates from OwnersFeed. 3. Double-click the Removing Duplicates from OwnersFeed task to go to the Data Flow panel. Drop an Excel source from the Toolbox onto the Data Flow surface. Rename the Excel source OwnersFeed. 4. Double-click OwnersFeed to open the Excel Source Editor. Click the New button next to the OLE DB Connection Manager field. Type C:\SSIS\RawFiles\ OwnersFeed.xls in the Excel File Path field. Leave the “First row has column names” option checked. Click OK to add an Excel Connection Manager and return to the Excel Source Editor. Leave Table Or View selected in the Data Access Mode field. Click in the Name of the Excel sheet field and select Owners$ from the drop-down list. 5. Go to the Columns page and verify that all the columns have been selected. Click OK to close the Excel Source Editor. 6. Take an opportunity to rename the Excel Connection Manager as OwnersFeed Connection. 7. Right-click anywhere on a blank surface in the Connection Managers area and choose New Connection from the context menu. Select Excel from the Connection Manager Type list and click Add. This will open the Excel Connection Manager dialog box. Type C:\SSIS\RawFiles\DuplicateOwners.xls in the Excel File Path field. Leave the First Row Has Column Names check box selected and click OK to add this connection manager. Once it is added, rename this connection manager DuplicateOwners Connection. 470 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Remove Exact Duplicates) If you look at the data in the Figure 10-25, you will see that Johnathon Skinner and Kathrine Morris have duplicate records in the OwnersFeed. You will use a Sort transformation to remove these duplicates. Then you will check the OwnersFeed against the Owner table in the database to see whether any of the record exists in the Owner table. Johnathon Skinner also exists in the Owner table and will be removed from the input data flow using a Lookup transformation. The other variants of these owners will remain unaffected and will have to be dealt with separately. 8. Drop the Sort transformation onto the Data Flow surface and join OwnersFeed with it using the green path. 9. Double-click the Sort transformation to open the editor. Select all the Available Input Columns starting with FirstName, as shown in Figure 10-26. Select the check box for “Remove rows with duplicate sort values” and click OK to close the editor. Rename the Sort transformation Removing Exact Duplicates by Sort Key. 10. Add a Lookup transformation to the data flow and join the Sort transformation by dragging the green arrow onto it. 11. Double-click the Lookup transformation to open the Lookup Transformation Editor. Select “Redirect rows to no match output” option in the “Specify how to handle rows with no matching entries” field. Click the Connection page link in the pane. 12. Add the localhost.Campaign connection manager using the New button next to the OLE DB connection manager field. Then select [dbo].[owner] table from the drop-down list in the “Use a table or a view” field. 13. Go to the Columns tab, and map all the Available Input Columns to the Available Lookup Columns for the matching names as shown in Figure 10-27. Click to select the OwnerID column in the Available Lookup Columns to add it as a new column in the output. In this transformation, you are matching input records with the records in the Owner table, and for the matching record, you are adding OwnerID in the output to capture the ID of the record with which a match has been found. Click OK to close this editor. Rename the Lookup transformation Removing Exact Duplicates by Lookup. 14. Drop an Excel destination just below Removing Exact Duplicates By Lookup and join it to the Excel Destination. You will be asked to select an Output in the Input Output Selection dialog box. Select Lookup Match Output to connect to Excel Destination. Rename this Excel Destination Exact Duplicates. 15. Double-click the Excel Destination to configure it. Choose DuplicateOwners Connection in the OLE DB Connection Manager field if it’s not already selected. Chapter 10: Data Flow Transformations 471 16. Click the New button next to the Name of the Excel sheet field and verify the CREATE TABLE statement that it is creating an Exact Duplicates table—i.e., a worksheet in Excel. Click OK to accept. Select Exact_Duplicates in this field. Go to the Mappings page to create the necessary column mappings that happen automatically. Click OK to close this editor. Figure 10-26 Removing exact duplicates using a Sort transformation 472 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Remove Fuzzy Duplicates) You’ve extracted exact duplicates from the Lookup transformation to an Excel destination. Now you will redirect the remaining rows to a Fuzzy Lookup and extract the duplicates using fuzzy match logic. You will be removing all the variants of Johnathon Skinner. 17. Drop a Fuzzy Lookup transformation onto the Data Flow surface to the right of the Removing Exact Duplicates by Lookup component. Click Removing Exact Duplicates by Lookup and drag and drop the second green arrow, which is Figure 10-27 Removing exact duplicates using the Lookup transformation Chapter 10: Data Flow Transformations 473 actually the Lookup No Match Output, onto the Fuzzy Lookup transformation. You have redirected all the remaining rows that did not match in the exact lookup operation to the Fuzzy Lookup transformation. 18. Double-click the Fuzzy Lookup transformation and make sure the Generate New Index radio button is selected. Select the [dbo].[Owner] table from the drop-down list in the Reference Table Name field. Go to the Columns tab. 19. Select the check box of the OwnerID column in the Available Lookup Columns to add it in the fuzzy matched records, as shown in Figure 10-28. 20. Go to the Advanced page and set the Maximum number of matches to output per lookup equal to 3. Slide the Similarity Threshold slider to 0.30. Click OK Figure 10-28 Identifying fuzzy matched records with Fuzzy Lookup 474 Hands-On Microsoft SQL Server 2008 Integration Services to close the Transformation Editor. Rename this transformation as Identifying Duplicates by Fuzzy Lookup. 21. Drop the Conditional Split transformation just below the Identifying Duplicates by Fuzzy Lookup transformation and connect both the components using a green data flow path. Double-click the Conditional Split to edit it. You need to identify the duplicates using _Similarity and _Confidence columns added by the Fuzzy Lookup transformation in the pipeline. You may have to run your package several times before you can get workable values for this identification. Here, I’ve already done the work for this data to identify these values; add the following in the Condition field: _Similarity >= 0.60 && _Confidence >= 0.50 Type Fuzzy Lookup Matches in the Output Name field and Remaining Owners in the Default Output Name field, as shown in Figure 10-29. Click OK to close the editor. Rename this transformation Splitting Fuzzy Matched Duplicates. 22. Drop an Excel Destination below the Conditional Split transformation and join both of these components using a green arrow. Select Fuzzy Lookup Matches in the Output field of the Input Output Selection dialog box. Rename this Excel Destination Fuzzy Matched Duplicates. 23. Double-click the Excel Destination to configure it. Select DuplicateOwners Connection in the OLE DB Connection Manager field if it is not already selected. 24. Click the New button next to the Name of the Excel sheet field, verify in the CREATE TABLE statement that is creating the Fuzzy Matched Duplicates table—i.e., the worksheet in Excel—click OK to accept, and then select the Fuzzy_Matched_Duplicates in this field. 25. Go to the Mappings page to create necessary column mappings automatically. Note that in addition to the _Similarity and _Confidence columns, the Fuzzy Lookup transformation has added one _Similarity_ColumnName column for each column that participated in the fuzzy match. Click OK to close this editor. Exercise (Remove Duplicates by Fuzzy Grouping) Here you will remove duplicate variants of the Kathrine Morris. These records do not exist in the reference table but appear multiple times in the input file in variant forms. You will capture the Remaining Owners from Split transformation and use a fuzzy grouping to identify canonical row and the likely duplicates. 26. Drop Fuzzy Grouping in the pipeline to the right of the Splitting Fuzzy Duplicates component. Click Splitting Fuzzy Duplicates and then drag the available green arrow from the Splitting Fuzzy Duplicates and drop it on the Fuzzy Grouping transformation. Chapter 10: Data Flow Transformations 475 27. Double-click the Fuzzy Grouping transformation and verify that the localhost. Campaign is selected in the OLE DB Connection Manager field. Go to the Columns tab. Note that all the Available Input Columns have Pass Through check boxes selected. Uncheck all of them. Select only the columns that are coming from OwnersFeed—i.e., from FirstName to Country, as shown in Figure 10-30. As you select these columns, a line corresponding to each selected column will be added in the grid. Type 0.25 in the Minimum Similarity column for the FirstName and LastName rows. Figure 10-29 Splitting fuzzy matched records from the Data Flow. 476 Hands-On Microsoft SQL Server 2008 Integration Services 28. Go to the Advanced tab and set the Similarity threshold to 0.50 using the slider. Click OK to close this editor. Rename the Fuzzy Grouping transformation Identifying Duplicates by Fuzzy Grouping. 29. Drop the Conditional Split transformation just below the Identifying Duplicates by Fuzzy Grouping transformation and connect both the components using a green data flow path. Double-click the Conditional Split to edit it. You need to split the duplicates using _key_in and _key_out columns added by the Fuzzy Figure 10-30 Identifying and grouping duplicates using the Fuzzy Grouping transformation Chapter 10: Data Flow Transformations 477 Grouping transformation in the pipeline. Add the two rows in the grid by adding the following in the Condition field: _key_in == _key_out _key_in != _key_out Type Canonical Row in the first row and Fuzzy Grouped Matches in the second row of Output Name field, as shown in Figure 10-31. Click OK to close the editor. Rename this transformation Splitting Fuzzy Grouped Duplicates. Figure 10-31 Splitting fuzzy grouped and unique records in the data flow . 468 Hands-On Microsoft SQL Server 2008 Integration Services Hands-On: Removing Duplicates from Owners Data The Fuzzy Lookup and. matched records from the Data Flow. 476 Hands-On Microsoft SQL Server 2008 Integration Services 28. Go to the Advanced tab and set the Similarity threshold to 0 .50 using the slider. Click OK to close. editor. Figure 10-26 Removing exact duplicates using a Sort transformation 472 Hands-On Microsoft SQL Server 2008 Integration Services Exercise (Remove Fuzzy Duplicates) You’ve extracted exact duplicates